데이터 세트 변경 후 기존 표준 편차를 사용하여 새로운 표준 편차 계산

I have an array of $n$ real values, which has mean $\mu_{old}$ and standard deviation $\sigma_{old}$ . If an element of the array $x_i$ is replaced by another element $x_j$ , then new mean will be

$\mu_{new}=\mu_{old}+\frac{x_j-x_i}{n}$

이 방법의 장점은 값에 관계없이 일정한 계산이 필요하다는 것 입니다. 계산에 대한 접근도는 사용 의 계산과 같은 사용 ? $n$ $\sigma_{new}$ $\sigma_{old}$ $\mu_{new}$ $\mu_{old}$

standard-deviation online

— 사용자
소스

숙제입니까? 우리의 수학적 통계 과정에서 매우 비슷한 과제가 요청되었습니다.

— krlmlr

@user946850: No, it's not homework. I am conducting my thesis on Evolutionary Algorithm. I want to use standard deviation as a measure of population diversity. Just looking for more efficient solution.

— user

The SD is the square root of the variance, which is just the mean squared value (adjusted by a multiple of the squared mean, which you already know how to update). Therefore, the same methods used to compute a running mean can be applied without any fundamental change to compute a running variance. In fact, much more sophisticated statistics can be computed on an online basis using the same ideas: see the threads at stats.stackexchange.com/questions/6920 and stats.stackexchange.com/questions/23481, for example.

— whuber

@whuber: This is mentioned in the Wikipedia article for Variance, but also with a note on catastrophic cancellation (or loss of significance) that may occur. Is this overrated, or a real problem for the running variance?

— krlmlr

좋은 질문입니다. 미리 중심을 맞추지 않고 순진하게 분산을 누적하면 실제로 문제가 발생할 수 있습니다. 숫자는 크지 만 분산이 작을 때 문제가 발생합니다. 예는 299792458.145, 299792457.883, 299792457.998, ...와 같이, S / m에 빛의 속도를 정확히 측정하는 일련 고려해 약 0.01 그들의 분산, 주위 그 사각형에 비해 너무 작아서

, 부주의 한 계산 (배정 밀도에서도)은 분산이 0이되며 모든 유효 숫자는 사라질 것이다.

10^{17}

$10^{17}$

— whuber

답변:

"분산 계산하기위한 알고리즘"에 대한 위키 백과의 문서 섹션 방법 요소가 귀하의 관찰에 추가하는 경우 분산을 계산하는 방법을 보여줍니다. (표준 편차는 분산의 제곱근입니다.) 한다고 가정합니다. $x_{n+1}$ 을 배열에 추가 한 다음

σ_{n e w}^{2} = σ_{o l d}^{2} + (x_{n + 1} - μ_{n e w}) (x_{n + 1} - μ_{o l d}) .

$\sigma_{new}^2 = \sigma_{old}^2 + (x_{n+1} - \mu_{new})(x_{n+1} - \mu_{old}).$

EDIT: Above formula seems to be wrong, see comment.

Now, replacing an element means adding an observation and removing another one; both can be computed with the formula above. However, keep in mind that problems of numerical stability may ensue; the quoted article also proposes numerically stable variants.

To derive the formula by yourself, compute $(n-1)(\sigma_{new}^2 - \sigma_{old}^2)$ using the definition of sample variance and substitute $\mu_{new}$ by the formula you gave when appropriate. This gives you $\sigma_{new}^2 - \sigma_{old}^2$ in the end, and thus a formula for $\sigma_{new}$ given $\sigma_{old}$ and $\mu_{old}$ . In my notation, I assume you replace the element $x_n$ by $x_n'$ :

\begin{array}{rcl} σ^{2} & = & (n - 1)^{- 1} \sum_{k} (x_{k} - μ)^{2} \\ (n - 1) (σ_{n e w}^{2} - σ_{o l d}^{2}) & = & \sum_{k = 1}^{n - 1} ((x_{k} - μ_{n e w})^{2} - (x_{k} - μ_{o l d})^{2}) \\ + ((x_{n}^{'} - μ_{n e w})^{2} - (x_{n} - μ_{o l d})^{2}) \\ = & \sum_{k = 1}^{n - 1} ((x_{k} - μ_{o l d} - n^{- 1} (x_{n}^{'} - x_{n}))^{2} - (x_{k} - μ_{o l d})^{2}) \\ + ((x_{n}^{'} - μ_{o l d} - n^{- 1} (x_{n}^{'} - x_{n}))^{2} - (x_{n} - μ_{o l d})^{2}) \end{array}

$\begin{eqnarray*} \sigma^2 &=& (n-1)^{-1} \sum_k (x_k - \mu)^2 \\ (n-1)(\sigma_{new}^2 - \sigma_{old}^2) &=& \sum_{k=1}^{n-1} ((x_k - \mu_{new})^2 - (x_k - \mu_{old})^2) \\ &&+\ ((x_n' - \mu_{new})^2 - (x_n - \mu_{old})^2) \\ &=& \sum_{k=1}^{n-1} ((x_k - \mu_{old} - n^{-1}(x_n'-x_n))^2 - (x_k - \mu_{old})^2) \\ &&+\ ((x_n' - \mu_{old} - n^{-1}(x_n'-x_n))^2 - (x_n - \mu_{old})^2) \\ \end{eqnarray*}\\$

The $x_k$ in the sum transform into something dependent of $\mu_{old}$ , but you'll have to work the equation a little bit more to derive a neat result. This should give you the general idea.

— krlmlr
소스

the first formula you gave does not seem correct, well it means that if the

x_{n + 1}

$x_{n+1}$ is smaller/larger then from both new and old mean, the variance always increases, which does not make any sense. It may increase or decrease depending on the distribution.

— Emmet B

@EmmetB: Yes, you're right -- this should probably be

σ_{n e w}^{2} = \frac{n - 1}{n} σ_{o l d}^{2} + \frac{1}{n} (x_{n + 1} - μ_{n e w}) (x_{n + 1} - μ_{o l d}) .

$\sigma_{new}^2 = \frac{n-1}{n} \sigma_{old}^2 + \frac{1}{n} (x_{n+1} - \mu_{new})(x_{n+1} - \mu_{old}).$ Unfortunately, this renders void my whole discussion from there, but I'm leaving it for historic purposes. Feel free to edit, though.

— krlmlr

Based on what i think i'm reading on the linked Wikipedia article you can maintain a "running" standard deviation:

real sum = 0;
int count = 0;
real S = 0;
real variance = 0;

real GetRunningStandardDeviation(ref sum, ref count, ref S, x)
{
   real oldMean;

   if (count >= 1)
   {
       real oldMean = sum / count;
       sum = sum + x;
       count = count + 1;
       real newMean = sum / count;

       S = S + (x-oldMean)*(x-newMean)
   }
   else
   {
       sum = x;
       count = 1;
       S = 0;         
   }

   //estimated Variance = (S / (k-1) )
   //estimated Standard Deviation = sqrt(variance)
   if (count > 1)
      return sqrt(S / (count-1) );
   else
      return 0;
}

Although in the article they don't maintain a separate running sum and count, but instead have the single mean. Since in thing i'm doing today i keep a count (for statistical purposes), it is more useful to calculate the means each time.

— Ian Boyd
소스

Given original $\bar x$ , $s$ , and $n$ , as well as the change of a given element $x_n$ to $x_n'$ , I believe your new standard deviation $s'$ will be the square root of

s^{2} + \frac{1}{n - 1} (2 n Δ \bar{x} (x_{n} - \bar{x}) + n (n - 1) (Δ \bar{x})^{2}),

$s^2 + \frac{1}{n-1}\left(2n\Delta \bar x(x_n-\bar x) +n(n-1)(\Delta \bar x)^2\right),$ where

Δ \bar{x} = {\bar{x}}^{'} - \bar{x}

$\Delta \bar x = \bar x' - \bar x$ , with

{\bar{x}}^{'}

$\bar x'$ denoting the new mean.

Maybe there is a snazzier way of writing it?

I checked this against a small test case and it seemed to work.

— Whistling in the Dark
소스

@john / whistling in the Dark: I liked your answer, it seems work properly in my small dataset. Is there any mathematical foundation/reference on it? Could you kindly help?

— Alok Chowdhury

The question was all @Whistling in the Dark, I just cleaned it up for the site. You should pose a new question referencing the question and answer here. And also you should upvote this answer if you feel that way.

— John