데이터 세트 변경 후 기존 표준 편차를 사용하여 새로운 표준 편차 계산


16

I have an array of n real values, which has mean μold and standard deviation σold. If an element of the array xi is replaced by another element xj, then new mean will be

μnew=μold+xjxin

이 방법의 장점은 값에 관계없이 일정한 계산이 필요하다는 것 입니다. 계산에 대한 접근도는 σ N E w 사용 σ O L D를 의 계산과 같은 μ n은 전자 w 사용 μ O L D는 ?nσnewσoldμnewμold


숙제입니까? 우리의 수학적 통계 과정에서 매우 비슷한 과제가 요청되었습니다.
krlmlr

2
@user946850: No, it's not homework. I am conducting my thesis on Evolutionary Algorithm. I want to use standard deviation as a measure of population diversity. Just looking for more efficient solution.
user

1
The SD is the square root of the variance, which is just the mean squared value (adjusted by a multiple of the squared mean, which you already know how to update). Therefore, the same methods used to compute a running mean can be applied without any fundamental change to compute a running variance. In fact, much more sophisticated statistics can be computed on an online basis using the same ideas: see the threads at stats.stackexchange.com/questions/6920 and stats.stackexchange.com/questions/23481, for example.
whuber

1
@whuber: This is mentioned in the Wikipedia article for Variance, but also with a note on catastrophic cancellation (or loss of significance) that may occur. Is this overrated, or a real problem for the running variance?
krlmlr

좋은 질문입니다. 미리 중심을 맞추지 않고 순진하게 분산을 누적하면 실제로 문제가 발생할 수 있습니다. 숫자는 크지 만 분산이 작을 때 문제가 발생합니다. 예는 299792458.145, 299792457.883, 299792457.998, ...와 같이, S / m에 빛의 속도를 정확히 측정하는 일련 고려해 약 0.01 그들의 분산, 주위 그 사각형에 비해 너무 작아서 , 부주의 한 계산 (배정 밀도에서도)은 분산이 0이되며 모든 유효 숫자는 사라질 것이다. 1017
whuber

답변:


7

"분산 계산하기위한 알고리즘"에 대한 위키 백과의 문서 섹션 방법 요소가 귀하의 관찰에 추가하는 경우 분산을 계산하는 방법을 보여줍니다. (표준 편차는 분산의 제곱근입니다.) x n + 1을 더 한다고 가정합니다.xn+1 을 배열에 추가 한 다음

σnew2=σold2+(xn+1μnew)(xn+1μold).

EDIT: Above formula seems to be wrong, see comment.

Now, replacing an element means adding an observation and removing another one; both can be computed with the formula above. However, keep in mind that problems of numerical stability may ensue; the quoted article also proposes numerically stable variants.

To derive the formula by yourself, compute (n1)(σnew2σold2) using the definition of sample variance and substitute μnew by the formula you gave when appropriate. This gives you σnew2σold2 in the end, and thus a formula for σnew given σold and μold. In my notation, I assume you replace the element xn by xn:

σ2=(n1)1k(xkμ)2(n1)(σnew2σold2)=k=1n1((xkμnew)2(xkμold)2)+ ((xnμnew)2(xnμold)2)=k=1n1((xkμoldn1(xnxn))2(xkμold)2)+ ((xnμoldn1(xnxn))2(xnμold)2)

The xk in the sum transform into something dependent of μold, but you'll have to work the equation a little bit more to derive a neat result. This should give you the general idea.


the first formula you gave does not seem correct, well it means that if the xn+1 is smaller/larger then from both new and old mean, the variance always increases, which does not make any sense. It may increase or decrease depending on the distribution.
Emmet B

@EmmetB: Yes, you're right -- this should probably be σnew2=n1nσold2+1n(xn+1μnew)(xn+1μold). Unfortunately, this renders void my whole discussion from there, but I'm leaving it for historic purposes. Feel free to edit, though.
krlmlr

4

Based on what i think i'm reading on the linked Wikipedia article you can maintain a "running" standard deviation:

real sum = 0;
int count = 0;
real S = 0;
real variance = 0;

real GetRunningStandardDeviation(ref sum, ref count, ref S, x)
{
   real oldMean;

   if (count >= 1)
   {
       real oldMean = sum / count;
       sum = sum + x;
       count = count + 1;
       real newMean = sum / count;

       S = S + (x-oldMean)*(x-newMean)
   }
   else
   {
       sum = x;
       count = 1;
       S = 0;         
   }

   //estimated Variance = (S / (k-1) )
   //estimated Standard Deviation = sqrt(variance)
   if (count > 1)
      return sqrt(S / (count-1) );
   else
      return 0;
}

Although in the article they don't maintain a separate running sum and count, but instead have the single mean. Since in thing i'm doing today i keep a count (for statistical purposes), it is more useful to calculate the means each time.


0

Given original x¯, s, and n, as well as the change of a given element xn to xn, I believe your new standard deviation s will be the square root of

s2+1n1(2nΔx¯(xnx¯)+n(n1)(Δx¯)2),
where Δx¯=x¯x¯, with x¯ denoting the new mean.

Maybe there is a snazzier way of writing it?

I checked this against a small test case and it seemed to work.


1
@john / whistling in the Dark: I liked your answer, it seems work properly in my small dataset. Is there any mathematical foundation/reference on it? Could you kindly help?
Alok Chowdhury

The question was all @Whistling in the Dark, I just cleaned it up for the site. You should pose a new question referencing the question and answer here. And also you should upvote this answer if you feel that way.
John
당사 사이트를 사용함과 동시에 당사의 쿠키 정책개인정보 보호정책을 읽고 이해하였음을 인정하는 것으로 간주합니다.
Licensed under cc by-sa 3.0 with attribution required.