분산이 서로 뒤 따르는 모든 값의 차이로 정의되지 않는 이유는 무엇입니까?


19

이것은 많은 사람들에게 간단한 질문 일지 모르지만 여기에 있습니다.

분산이 평균값의 차이 대신에 뒤 따르는 모든 값의 차이로 정의되지 않는 이유는 무엇입니까?

이것은 나에게 더 논리적 인 선택 일 것입니다. 나는 분명히 몇 가지 단점을 감독하고 있다고 생각합니다. 감사

편집하다:

가능한 한 명확하게 문구를 바꾸겠습니다. 이것이 내가 의미하는 바입니다.

  1. 1,2,3,4,5의 순서로 숫자가 있다고 가정하십시오.
  2. (평균을 사용하지 않고) 값들 사이의 (절대적으로, 모든 후속 값 사이의 연속적인 차이)를 계산하고 합산하십시오.
  3. 차이의 수로 나누기
  4. (후속 조치 : 번호가 정렬되지 않은 경우 답변이 다를 수 있음)

-> 분산의 표준 공식과 비교할 때이 방법의 단점은 무엇입니까?


1
자기 상관 (예 : stats.stackexchange.com/questions/185521/… )에 대한 내용을 읽어 보시기 바랍니다 .

2
@ user2305193 whuber의 대답은 정확하지만 그의 공식은 데이터 순서와 모든 순서에 대한 평균 사이의 제곱 거리를 사용합니다. 그러나 깔끔한 트릭, 당신이 표시 한 분산을 찾는 과정은 내가 대답에 구현하려고 시도한 것과 잘 작동하지 않을 것임을 시연했습니다. 혼란을 없애려고합니다.
Greenparker

1
재미를 위해 Allan Variance를 찾으십시오.
hobbs

다른 생각에, 당신은 차이를 제곱하지 않기 때문에 (그리고 나중에 제곱근을 취하지 않음) 절대 값을 취하기 때문에, 이것이 '왜 우리가 표준 편차를 계산하는 방법이 아닌가?' '왜 우리가 분산을 계산하는 방법이 아닌가?'대신. 그러나 나는 지금 그것을 쉬게 할 것이다
user2305193

답변:


27

가장 확실한 이유는 종종 값에 시간 순서가 없기 때문입니다. 따라서 데이터가 혼란스러워도 데이터가 전달하는 정보에는 차이가 없습니다. 우리가 당신의 방법을 따르는 경우, 데이터를 뒤죽박죽 때마다 다른 샘플 차이가 발생합니다.

더 이론적 인 대답은 표본 분산이 랜덤 변수의 실제 분산을 추정한다는 것입니다. 랜덤 변수의 진정한 분산 E는 [ ( X - E X ) 2 ] .엑스

이자형[(엑스이자형엑스)2].

여기서 는 기대 값 또는 "평균값"을 나타냅니다. 따라서 분산의 정의는 평균값에서 변수 사이의 평균 제곱 거리입니다. 이 정의를 보면 데이터가 없으므로 여기에 "시간 순서"가 없습니다. 임의 변수의 속성 일뿐입니다.이자형

이 분포에서 iid 데이터를 수집하면 실현 있습니다. 기대치를 추정하는 가장 좋은 방법은 표본 평균을 취하는 것입니다. 여기서 핵심은 iid 데이터를 가져 와서 데이터에 대한 순서가 없다는 것입니다. 샘플 x 1 , x 2 , , x n 은 샘플 x 2 , x 5 , x 1 , x n과 동일 합니다. .엑스1,엑스2,,엑스x1,x2,,xnx2,x5,x1,xn..

편집하다

표본 분산은 표본과의 평균 거리를 측정하는 표본의 특정 종류의 분산을 측정합니다. 데이터 범위 및 양자 간 범위와 같은 다른 종류의 분산이 있습니다.

값을 오름차순으로 정렬하더라도 샘플의 특성은 변하지 않습니다. 얻은 샘플 (데이터)는 변수에서 실현됩니다. 표본 분산 계산은 변수에 얼마나 많은 분산이 있는지 이해하는 것과 유사합니다. 예를 들어, 20 명을 샘플링하고 신장을 계산하는 경우 무작위 변수 사람의 신장 에서 20 개의 "실현"입니다 . 이제 표본 분산은 일반적으로 개인의 신장 변동을 측정해야합니다. 데이터를 주문하면 100 , 110 , 123 , 124 , ,X=

100,110,123,124,,

샘플의 정보는 변경되지 않습니다.

하나 더 예를 보자. 이 방법으로 정렬 된 임의의 변수에서 100 개의 관측치가 있다고 가정합니다 . . . 100. 그런 다음 평균 후속 거리는 1 단위이므로 방법에 따라 분산은 1이됩니다.

1,2,3,4,5,6,7,8,9,10,11,12,13,14,...100.

"분산"또는 "분산"을 해석하는 방법은 데이터에 어떤 범위의 값이 있는지 이해하는 것입니다. 이 경우 .99 단위의 범위를 얻게되며 물론 변동을 잘 나타내지 않습니다.

평균을 취하는 대신 후속 차이를 합하면 분산은 99가됩니다. 물론 99는 가변성이 아니라 데이터 범위를 제공하기 때문에 표본의 변동성을 나타내지 않습니다.


1
마지막 단락으로 당신이 나에게 연락했습니다. haha,이 flabbergasting 답변에 감사드립니다. 나는 그것을 찬성하기에 충분한 담당자가 있었으면 좋겠습니다.
user2305193

후속 조치 : 실제로 의미하는 것은 (예, 죄송합니다. 답을 읽은 후에 올바른 질문 만 깨달았습니다) 차이점을 요약하고 샘플 수로 나누는 것입니다. 마지막 예제에서 99/100이 될 것입니다-완벽한 flabbergasted-ness를 위해 그것을 자세히 설명 할 수 있습니까?
user2305193

@ user2305193 맞아, 나는 평균적으로 1 단위를 말했는데, 그것은 틀렸다. .99 단위 였어야합니다. 그것을 바꿨다.
Greenparker

1-100 시리즈에 대한 자세한 정보 : 1-100의 분산은 841.7이고 표준 편차는 29.01 입니다 . 따라서 실제로 다른 결과입니다.
user2305193

31

그런 식 으로 정의됩니다!

여기 대수가 있습니다. 값을 . 넣어야 F (각 수단이 값의 경험적 분포 함수 X I 의 기여 확률 질량 1 / N 값에서 X I를 )하고 있도록 XY가 분포 독립 확률 변수 일 F . 분산의 기본 속성 (즉, 2 차 형태 임)과 F 의 정의 와 사실에 의해x=(x1,x2,,xn)Fxi1/nxiXYFF Y 는 같은 의미입니다XY

Var(x)=Var(X)=12(Var(X)+Var(Y))=12(Var(XY))=12(E((XY)2)E(XY)2)=E(12(XY)2)0=1n2i,j12(xixj)2.

이 공식은 가 정렬 되는 방식에 의존하지 않습니다 . 가능한 모든 구성 요소 쌍을 사용하여 제곱 차이의 절반을 사용하여 비교합니다. 그러나 모든 가능한 순서 ( 지수 1 , 2 , , n 의 모든 n ! 순열의 그룹 S ( n )) 에 대한 평균 과 관련 될 수 있습니다 . 즉,xS(n)n!1,2,,n

Var(x)=1n2i,j12(xixj)2=1n!σS(n)1ni=1n112(xσ(i)xσ(i+1))2.

이 내부 합산은 재정렬 된 값 취하고 모든 n - 1 연속 쌍 사이의 (반) 제곱 차이를 합합니다 . n 으로 나눈 값은 본질적으로 이러한 연속 제곱 차이를 평균합니다 . 지연 -1 반 분산 이라고 알려진 것을 계산합니다 . 외부 합계는 가능한 모든 주문에 대해이 작업 을 수행합니다 .xσ(1),xσ(2),,xσ(n)n1n


These two equivalent algebraic views of the standard variance formula give new insight into what the variance means. The semivariance is an inverse measure of the serial covariance of a sequence: the covariance is high (and the numbers are positively correlated) when the semivariance is low, and conversely. The variance of an unordered dataset, then, is a kind of average of all possible semivariances obtainable under arbitrary reorderings.


1
@Mur1lo On the contrary: I believe this derivation is correct. Apply the formula to some data and see!
whuber

1
I think Mur1lo may have been talking not about the correctness of the formula for variance but about apparently passing directly from expectations of random variables to functions of sample quantities.
Glen_b -Reinstate Monica

1
@glen But that's precisely what the empirical distribution function lets us do. That's the entire point of this approach.
whuber

3
Yes, that's clear to me; I was trying to point out where the confusion seemed to lay. Sorry to be vague. Hopefully it's clearer now why it only appears* to be a problem. *(this why I used the word "apparent" earlier, to emphasize it was just the out-of-context appearance of that step that was likely to be the cause of the confusion)
Glen_b -Reinstate Monica

2
@Mur1o The only thing I have done in any of these equations is to apply definitions. There is no passing from expectations to "sample quantities". (In particular, no sample of F has been posited or used.) Thus I am unable to identify what the apparent problem is, nor suggest an alternative explanation. If you could expand on your concern then I might be able to respond.
whuber

11

Just a complement to the other answers, variance can be computed as the squared difference between terms:

Var(X)=12n2injn(xixj)2=12n2injn(xix¯xj+x¯)2=12n2injn((xix¯)(xjx¯))2=1nin(xix¯)2

I think this is the closest to the OP proposition. Remember the variance is a measure of dispersion of every observation at once, not only between "neighboring" numbers in the set.


UPDATE

Using your example: X=1,2,3,4,5. We know the variance is Var(X)=2.

With your proposed method Var(X)=1, so we know beforehand taking the differences between neighbors as variance doesn't add up. What I meant was taking every possible difference squared then summed:

Var(X)==(51)2+(52)2+(53)2+(54)2+(55)2+(41)2+(42)2+(43)2+(44)2+(45)2+(31)2+(32)2+(33)2+(34)2+(35)2+(21)2+(22)2+(23)2+(24)2+(25)2+(11)2+(12)2+(13)2+(14)2+(15)2252==16+9+4+1+9+4+1+1+4+1+1+4+1+1+4+9+1+4+9+1650==2

Now I'm seriously confused guys
user2305193

@user2305193 In your question, did you mean every pairwise difference or did you mean the difference between a value and the next in a sequence? Could you please clarify?
Firebug

2
@Mur1lo no one is though, I have no idea what you're referring to.
Firebug

2
@Mur1lo This is a general question, and I answered it generally. Variance is a computable parameter, which can be estimated from samples. This question isn't about estimation though. Also we are talking about discrete sets, not about continuous distributions.
Firebug

1
You showed how to estimate the variance by its U-statistic and its fine. The problem is when you write: Var("upper case"X) = things involving "lower case" x, you are mixing the two different notions of parameter and of estimator.
Mur1lo

6

Others have answered about the usefulness of variance defined as usual. Anyway, we just have two legitimate definitions of different things: the usual definition of variance, and your definition.

Then, the main question is why the first one is called variance and not yours. That is just a matter of convention. Until 1918 you could have invented anything you want and called it "variance", but in 1918 Fisher used that name to what is still called variance, and if you want to define anything else you will need to find another name to name it.

The other question is if the thing you defined might be useful for anything. Others have pointed its problems to be used as a measure of dispersion, but it's up to you to find applications for it. Maybe you find so useful applications that in a century your thing is more famous than variance.


I know every definition is up to the people deciding on it, I really was looking for help in up/downsides for each approaches. Usually there's good reason for people converging to a definition and as I suspected didn't see why straight away.
user2305193

1
Fisher introduced variance as a term in 1918 but the idea is older.
Nick Cox

As far as I know, Fisher was the first one to use the name "variance" for variance. That's why I say that before 1918 you could have use "variance" to name anything else you had invented.
Pere

3

@GreenParker answer is more complete, but an intuitive example might be useful to illustrate the drawback to your approach.

In your question, you seem to assume that the order in which realisations of a random variable appear matters. However, it is easy to think of examples in which it doesn't.

Consider the example of the height of individuals in a population. The order in which individuals are measured is irrelevant to both the mean height in the population and the variance (how spread out those values are around the mean).

Your method would seem odd applied to such a case.


2

Although there are many good answers to this question I believe some important points where left behind and since this question came up with a really interesting point I would like to provide yet another point of view.

Why isn't variance defined as the difference between every value following    
each other instead of the difference to the average of the values?

The first thing to have in mind is that the variance is a particular kind of parameter, and not a certain type of calculation. There is a rigorous mathematical definition of what a parameter is but for the time been we can think of then as mathematical operations on the distribution of a random variable. For example if X is a random variable with distribution function FX then its mean μx, which is also a parameter, is:

μX=+xdFX(x)

and the variance of X, σX2, is:

σX2=+(xμX)2dFX(x)

The role of estimation in statistics is to provide, from a set of realizations of a r.v., a good approximation for the parameters of interest.

What I wanted to show is that there is a big difference in the concepts of a parameters (the variance for this particular question) and the statistic we use to estimate it.

Why isn't the variance calculated this way?

So we want to estimate the variance of a random variable X from a set of independent realizations of it, lets say x={x1,,xn}. The way you propose doing it is by computing the absolute value of successive differences, summing and taking the mean:

ψ(x)=1ni=2n|xixi1|

and the usual statistic is:

S2(x)=1n1i=in(xix¯)2,

where x¯ is the sample mean.

When comparing two estimator of a parameter the usual criterion for the best one is that which has minimal mean square error (MSE), and a important property of MSE is that it can be decomposed in two components:

MSE = estimator bias + estimator variance.

Using this criterion the usual statistic, S2, has some advantages over the one you suggests.

  • First it is a unbiased estimator of the variance but your statistic is not unbiased.

  • One other important thing is that if we are working with the normal distribution then S2 is the best unbiased estimator of σ2 in the sense that it has the smallest variance among all unbiased estimators and thus minimizes the MSE.

When normality is assumed, as is the case in many applications, S2 is the natural choice when you want to estimate the variance.


3
Everything in this answer is well explained, correct, and interesting. However, introducing the "usual statistic" as an estimator confuses the issue, because the question is not about estimation, nor about bias, nor about the distinction between 1/n and 1/(n1). That confusion might be at the root of your comments to several other answers in this thread.
whuber


1

Lots of good answers here, but I'll add a few.

  1. The way it is defined now has proven useful. For example, normal distributions appear all the time in data and a normal distribution is defined by its mean and variance. Edit: as @whuber pointed out in a comment, there are various other ways specify a normal distribution. But none of them, as far as I'm aware, deal with pairs of points in sequence.
  2. Variance as normally defined gives you a measure of how spread out the data is. For example, lets say you have a lot of data points with a mean of zero but when you look at it, you see that the data is mostly either around -1 or around 1. Your variance would be about 1. However, under your measure, you would get a total of zero. Which one is more useful? Well, it depends, but its not clear to me that a measure of zero for its "variance" would make sense.
  3. It lets you do other stuff. Just an example, in my stats class we saw a video about comparing pitchers (in baseball) over time. As I remember it, pitchers appeared to be getting worse since the proportion of pitches that were hit (or were home-runs) was going up. One reason is that batters were getting better. This made it hard to compare pitchers over time. However, they could use the z-score of the pitchers to compare them over time.

Nonetheless, as @Pere said, your metric might prove itself very useful in the future.


1
A normal distribution can also be determined by its mean and fourth central moment, for that matter -- or by means of many other pairs of moments. The variance is not special in that way.
whuber

@whuber interesting. I'll admit I didn't realize that. Nonetheless, unless I'm mistaken, all the moments are "variance like" in that they are based on distances from a certain point as opposed to dealing with pairs of points in sequence. But I'll edit my answers to make note of what you said.
roundsquare

1
Could you explain the sense in which you mean "deal with pairs of points in sequence"? That's not a part of any standard definition of a moment. Note, too, that all the absolute moments around the mean--which includes all even moments around the mean--give a "measure of how spread out the data" are. One could, therefore, construct an analog of the Z-score with them. Thus, none of your three points appears to differentiate the variance from any absolute central moment.
whuber

@whuber yeah. The original question posited a 4 step sequence where you sort the points, take the differences between each point and the next point, and then average these. That's what I referred to as "deal[ing] with pairs of points in sequence". So you are right, none of the three points I gave distinguishes variance from any absolute central moment - they are meant to distinguish variance (and, I suppose, all absolute central moments) from the procedure described in the original question.
roundsquare
당사 사이트를 사용함과 동시에 당사의 쿠키 정책개인정보 보호정책을 읽고 이해하였음을 인정하는 것으로 간주합니다.
Licensed under cc by-sa 3.0 with attribution required.