나는 통계의 멍청한 놈이므로 너희들이 나를 도와 줄 수 있니?
내 질문은 다음과 같습니다. 풀 분산은 실제로 무엇을 의미합니까?
인터넷에서 풀링 된 분산에 대한 공식을 찾을 때 다음 공식을 사용하여 많은 문헌을 찾습니다 (예 : http://math.tntech.edu/ISR/Mathematical_Statistics/Introduction_to_Statistical_Tests/thispage/newnode19.html ) :
But what does it actually calculate? Because when I use this formula to calculate my pooled variance, it gives me wrong answer.
For example, consider these "parent sample":
The variance of this parent sample is , and its mean is .
Now, suppose I split this parent sample into two sub-samples:
- The first sub-sample is 2,2,2,2,2 with mean and variance .
- The second sub-sample is 8,8,8,8,8 with mean and variance .
Now, clearly, using the above formula to calculate the pooled/parent variance of these two sub-samples will produce zero, because and . So what does this formula actually calculate?
On the other hand, after some lengthy derivation, I found the formula which produces the correct pooled/parent variance is:
In the above formula, and .
I found a similar formula with mine, for example here: http://www.emathzone.com/tutorials/basic-statistics/combined-variance.html and also in Wikipedia. Although I have to admit that they don't look exactly the same like mine.
So again, what does pooled variance actually mean? Shouldn't it mean the variance of parent sample from the two sub-samples? Or I am completely wrong here?
Thank you in advance.
EDIT 1: Someone says that my two sub-samples above are pathological since they have zero variance. Well, I could give you a different example. Consider this parent sample:
The variance of this parent sample is , and its mean is .
Now, suppose I split this parent sample into two sub-samples:
- The first sub-sample is 1,2,3,4,5 with mean and variance .
- The second sub-sample is 46,47,48,49,50 with mean and variance .
Now, if you use "literature's formula" to compute the pooled variance, you will get 2.5, which is completely wrong, because the parent/pooled variance should be 564.7. Instead, if you use "my formula", you will get correct answer.
Please understand, I use extreme examples here to show people that the formula indeed wrong. If I use "normal data" which doesn't have a lot of variations (extreme cases), then the results from those two formulae will be very similar, and people could dismiss the difference due to rounding error, not because the formula itself is wrong.