분산 분석 가정 (분산 평등, 잔차 정규성)이 중요한 이유는 무엇입니까?

15

분산 분석을 실행할 때 데이터에 적용 할 수 있도록 테스트의 특정 가정이 존재해야한다는 메시지가 나타납니다. 테스트가 작동하기 위해 다음과 같은 가정이 필요한 이유를 이해하지 못했습니다.

종속 변수 (잔여)의 분산은 설계의 각 셀에서 동일해야합니다.
종속 변수 (잔여)는 설계의 각 셀에 대해 대략 정규 분포되어야합니다.

I understand that there is a bit of a grey area as to if these assumptions need to be met, but for the sake of argument, if these assumptions were utterly not met in a given data set, what would be the issue with using an ANOVA?

hypothesis-testing anova assumptions

— PaperRockBazooka
소스

연구의 목표는 무엇입니까?

— Subhash C. Davar

8

가정이 귀무 가설의 분포 특성이 해당 가정에 따라 계산되는 가설 검정 (및 구간)의 특성에 영향을 미치는 한 가정은 중요합니다.

특히, 가설 검정의 경우, 우리가 관심을 가질 수있는 것은 실제 유의 수준이 원하는 수준과 얼마나 멀리 떨어져 있는지, 관심있는 대안에 대한 검정력이 좋은지 여부입니다.

다음과 같은 가정에 관해 :

1. 차이의 평등

종속 변수 (잔여)의 분산은 설계의 각 셀에서 동일해야합니다.

이것은 적어도 표본 크기가 다르면 확실히 유의 수준에 영향을 줄 수 있습니다.

(편집 :) ANOVA F- 통계량은 두 분산 추정치의 비율입니다 (분할 및 분산 비교는 분산 분석 이라고 합니다). ). 분모는 일반적으로 모든 셀에서 발생하는 오차 분산 (잔차에서 계산)의 추정치이며, 그룹 평균의 변동을 기반으로 한 분자에는 모집단 평균의 변동과 하나의 두 성분이 있습니다. 오차 분산으로 인해. 널이 참이면 추정되는 두 분산이 동일합니다 (공통 오차 분산의 두 추정치). 이 일반적이지만 알려지지 않은 값은 (비율을 취했기 때문에) 상쇄되며, 오차 분포에만 의존하는 F- 통계량을 남깁니다 (우리가 보여줄 수있는 가정하에 F 분포가 있음). 테스트에 사용했습니다.)

[내 답변에 해당 정보 중 일부에 대한 자세한 내용이 있습니다. ]

그러나 여기서 두 모집단 분산은 서로 다른 크기의 두 표본에 따라 다릅니다. 분모를 고려하십시오 (분산 분석의 F- 통계량 및 t- 검정의 t- 통계량). 하나가 아닌 두 가지 분산 추정값으로 구성되므로 "오른쪽"분포 (축척 된 카이)가 없습니다. -의 경우 F와 제곱근에 대한 -square-모양과 스케일 모두 문제입니다).

결과적으로, F- 통계량 또는 t- 통계량은 더 이상 F- 또는 t- 분포를 가지지 않지만, 영향을받는 방식은 다음과 같은 모집단에서 표본을 추출했는지에 따라 다릅니다. 더 큰 분산. 이는 p- 값의 분포에 영향을 미칩니다.

널 (null) 하에서 (즉 모집단 평균이 같은 경우) p- 값의 분포는 균일하게 분포되어야합니다. 그러나 분산과 표본 크기가 다르지만 평균이 같으므로 (널을 기각하지 않으려는 경우) p- 값이 균일하게 분포되지 않습니다. 나는 당신에게 일어나는 일을 보여주기 위해 작은 시뮬레이션을했습니다. 이 경우, 나는 2 개의 그룹 만 사용했기 때문에 분산 분석은 동일한 분산 가정을 가진 2 표본 t- 검정과 같습니다. 따라서 표준 편차가 다른 것보다 10 배 크지 만 같은 평균을 갖는 두 개의 정규 분포에서 표본을 시뮬레이션했습니다.

왼쪽 그림의 경우, 더 큰 ( 집단 ) 표준 편차는 n = 5이고 더 작은 표준 편차는 n = 30입니다. 오른쪽 그림의 경우 표준 편차가 클수록 n = 30이되고 값이 작을수록 n = 5가됩니다. 나는 각각 10000 번 시뮬레이션하고 매번 p- 값을 찾았습니다. 각각의 경우에 히스토그램을 완전히 평평하게하기를 원합니다 (직사각형). 이것은 모든 유의성 수준 에서 수행 된 모든 테스트 가 실제로 해당 유형 I 오류율을 갖기 때문입니다. 특히 히스토그램의 가장 왼쪽 부분을 회색 선에 가깝게 유지하는 것이 가장 중요합니다. $\alpha$

보시다시피, 왼쪽 그림 (작은 표본에서 큰 변동) p- 값은 매우 작은 경향이 있습니다. null가 참 임에도 불구하고 귀무 가설을 매우 자주 (이 예에서는 거의 절반의 시간) 기각합니다 . 즉, 우리의 중요성 수준은 요청한 것보다 훨씬 큽니다. 오른쪽 그림에서 p- 값이 대부분 크며 (따라서 의미 수준이 요청한 것보다 훨씬 작음) 실제로는 1 만 번의 시뮬레이션에서 5 % 수준 (가장 작은 것)에서 거부하지 않았습니다. 여기서 p- 값은 0.055였다. [이것은 우리가 매우 낮은 수준으로 갈 수있는 힘 이 매우 낮다는 것을 기억할 때까지는 그렇게 나쁜 것 같지는 않을 것 입니다.]

그것은 상당히 결과입니다. 그렇기 때문에 분산이 거의 같을 것이라고 추정 할만한 이유가없는 경우 Welch-Satterthwaite 유형 t- 검정 또는 분산 분석을 사용하는 것이 좋습니다. 이 경우도 시뮬레이트했습니다. 여기서 표시하지 않은 시뮬레이트 된 p- 값의 두 분포는 거의 평평합니다.

2. 응답의 조건부 분포 (DV)

종속 변수 (잔여)는 설계의 각 셀에 대해 대략 정규 분포되어야합니다.

이것은 다소 직접적으로 덜 중요합니다. 정규 성과의 중간 편차의 경우 유의 수준은 더 큰 표본에 큰 영향을 미치지 않습니다 (파워는 가능합니다!).

다음은 값이 지수 분포 (동일한 분포 및 표본 크기)로 분포 된 한 가지 예입니다.이 유의 수준 문제는 작은 에서는 실질적 이지만 큰 에서는 감소하는 것을 볼 수 있습니다 . $n$ $n$

n = 5에서는 작은 p- 값이 너무 적다는 것을 알 수 있습니다 (5 % 테스트의 유의 수준은 약 절반이어야 함). n = 50에서는 문제가 5 % 감소합니다. 이 경우 검정의 실제 유의 수준은 약 4.5 %입니다.

따라서 우리는 "음, n이 유의 수준을 꽤 가깝게 얻을만큼 충분히 크면 괜찮습니다."라고 말하고 싶을 수도 있지만, 우리는 또한 상당한 힘을 행사할 수도 있습니다. 특히, 널리 사용되는 대안에 비해 t- 검정의 점근 적 상대 효율이 0이 될 수 있음이 알려져 있습니다. 이는 더 나은 시험 선택이 소량의 샘플 크기를 갖는 동일한 크기를 얻을 수 있음을 의미합니다. t- 검정. 대체 분포 테스트에서 필요한 것보다 t와 동일한 거듭 제곱을 갖기 위해 두 배 이상의 데이터를 필요로하기 위해 평범한 것 외에는 아무것도 필요하지 않습니다-인구 분포의 보통 꼬리보다 보통 무겁습니다. 적당히 큰 샘플이면 충분합니다.

(다른 분포를 선택하면 유의 수준이 원래보다 높아지거나 여기에서 본 것보다 실질적으로 낮아질 수 있습니다.)

— Glen_b-복귀 모니카
소스

자세한 답변을 위해 Glen에게 감사드립니다. 가정이 충족되지 않은 경우 설명 된 오류가 발생하는 이유에 대해 궁금합니다. 분산의 동등성 측면에서 다음을 작성하는 데 문제가 있습니까? : 표본 크기 그룹이 더 작은 그룹에서 더 큰 분산은 더 작은 표본에서 관찰 된 더 큰 분산이 모집단 수준의 분산을 나타내는 것으로 계산되고있는 것입니다 (일반적으로 n = 5 인 데이터 포인트의 수가 적음).

— PaperRockBazooka

(파트 2) 본질적으로 대표 샘플과 대표 대표 샘플 (상대적으로 말하면) 간의 불공평 한 비교는 ANOVA 처리 방식으로 인해 유형 1 오류를 유발할 수 있습니다.

— PaperRockBazooka

@ 종이 나는 그것이 여기서 문제라고 생각하지 않습니다. 더 작은 표본에서 표본 분산의 대표성이 아닙니다 (예 : 두 표본이 똑같이 작 으면 대표성에 두 배의 문제가 있지만이 문제는 없습니다). 문제에 대한 자세한 내용을 설명하기 위해 답변에 몇 개의 단락을 추가했습니다.

— Glen_b-복지 주 모니카

4

간단히 말해서, ANOVA는 잔차를 추가 , 제곱 및 평균화 하고 있습니다. 잔차는 모형이 데이터에 얼마나 적합한 지 알려줍니다. 이 예제에서는 다음에서 데이터 세트를 사용 했습니다 .PlantGrowthR

대조군 및 2 가지 상이한 처리 조건 하에서 수득 된 수율 (식물의 건조 중량에 의해 측정 된)을 비교하기위한 실험의 결과.

이 첫 번째 줄거리는 세 가지 치료 수준에 대한 총 평균을 보여줍니다.

빨간색 선은 잔차 입니다. 이제 개별 선의 길이를 제곱하고 길이를 더하여 평균 (모델)이 데이터를 얼마나 잘 설명하는지 알려주는 값을 얻게됩니다. 작은 숫자는 평균이 데이터 점을 잘 설명하고 큰 숫자는 평균이 데이터를 잘 설명하지 않음을 나타냅니다. 이 숫자를 총 제곱합 이라고합니다 .

$SS_{total}=\sum(x_i-\bar{x}_{grand})^2$ $x_{i}$ $\bar{x}_{grand}$

이제 처리 잔차 ( 처리 레벨 의 노이즈 라고도하는 잔차 제곱) 에 대해 동일한 작업을 수행합니다 .

그리고 공식 :

$SS_{residuals}=\sum(x_{ik}-\bar{x}_{k})^2$ $x_{ik}$ are the individual data points $i$ in the $k$ number of levels and $\bar{x}_{k}$ the mean across the treatment levels.

Lastly, we need to determine the signal in the data, which is known as the Model Sums of Squares, which will later be used to calculate whether the treatment means are any different from the grand mean:

And the formula:

$SS_{model}=\sum n_{k}(\bar{x}_k-\bar{x}_{grand})^2$ , where $n_{k}$ is the sample size $n$ in your $k$ number of levels, and $\bar{x}_k$ as well as $\bar{x}_{grand}$ the mean within and across the treatment levels, respectively.

Now the disadvantage with the sums of squares is that they get bigger as the sample size increase. To express those sums of squares relative to the number of observation in the data set, you divide them by their degrees of freedom turning them into variances. So after squaring and adding your data points you are now averaging them using their degrees of freedom:

$df_{total}=(n-1)$

$df_{residual}=(n-k)$

$df_{model}=(k-1)$

where $n$ is the total number of observations and $k$ the number of treatment levels.

This results in the Model Mean Square and the Residual Mean Square (both are variances), or the signal to noise ratio, which is known as the F-value:

$MS_{model}=\frac{SS_{model}}{df_{model}}$

$MS_{residual}=\frac{SS_{residual}}{df_{residual}}$

$F=\frac{MS_{model}}{MS_{residual}}$

The F-value describes the signal to noise ratio, or whether the treatment means are any different from the grand mean. The F-value is now used to calculate p-values and those will decide whether at least one of the treatment means will be significantly different from the grand mean or not.

Now I hope you can see that the assumptions are based on calculations with residuals and why they are important. Since we adding, squaring and averaging residuals, we should make sure that before we are doing this, the data in those treatment groups behaves similar, or else the F-value may be biased to some degree and inferences drawn from this F-value may not be valid.

Edit: I added two paragraphs to address the OP's question 2 and 1 more specifically.

Normality assumption: The mean (or expected value) is often used in statistics to describe the center of a distribution, however it is not very robust and easily influenced by outliers. The mean is the simplest model we can fit to the data. Since in ANOVA we are using the mean to calculate the residuals and the sums of squares (see formulae above), the data should be roughly normally distributed (normality assumption). If this is not the case, the mean may not be the appropriate model for the data since it wouldn’t give us a correct location of the center of the sample distribution. Instead once could use the median for example (see non parametric testing procedures).

Homogeneity of variance assumption: Later when we calculate the mean squares (model and residual), we are pooling the individual sums of squares from the treatment levels and averaging them (see formulae above). By pooling and averaging we are losing the information of the individual treatment level variances and their contribution to the mean squares. Therefore, we should have roughly the same variance among all treatment levels so that the contribution to the mean squares is similar. If the variances between those treatment levels were different, then the resulting mean squares and F-value would be biased and will influence the calculation of the p-values making inferences drawn from these p-values questionable (see also @whuber 's comment and @Glen_b 's answer).

This is how I see it for myself. It may not be 100% accurate (I am not a statistician) but it helps me understanding why satisfying the assumptions for ANOVA is important.

— Stefan
소스

Your account is good, but you stopped just short of answering the question! The

F

$F$ statistic is a useful description of the ANOVA no matter what. The homoscedasticity and normality assumptions are needed so that the

F

$F$ statistic will actually have an

F

$F$ ratio distribution; moreover, relatively small violations of either assumption tend to cause the

F

$F$ statistic's distribution to depart noticeably from the

F

$F$ ratio distribution, casting doubt on any p-values computed from it. That is why other answers, such as by @Glen_b, focus on this distribution.

— whuber

@whuber I appreciate your comment, it helps me learn. I will have to let this sink in and spend some time looking at the

F

$F$ ratio vs the

F

$F$ statistic distributions and how they influence the calculations of the p-values.

— Stefan

Thanks Stefan. I want to see if I am understanding you correctly. ANOVA essentially creates a grand mean out of all the data points of the set and compares how far away each group differs from this grand mean to understand if their is a statistically significant difference among them. If the discussed assumptions are not met, the grand mean is not very reflective of the groups being compared and it leads to a difficulty of comparison

— PaperRockBazooka

@PaperRockBazooka In an ANOVA you are comparing signal to noise. If you cannot detect a signal, i.e. the effect of your treatment on the outcome, you might as well take the grand mean as the model to describe the data. We are using the difference of the data points to the grand mean (

S S_{t o t a l}

$SS_{total}$ ), the difference of the data points to the treatment means (

S S_{r e s i d u a l}

$SS_{residual}$ ), and the difference of the treatment means to the grand mean (

S S_{m o d e l}

$SS_{model}$ ) to determine the signal to noise ratio. Try to calculate a simple One-Way ANOVA by hand. That helped me to understand it better.

— Stefan

0

ANOVA it's just a method, it calculates the F-test from your samples and it compares it to the F-distribution. You need some assumptions to decide what you want to compare and to calculate the p-values.

If you don't meet that assumptions you could calculate other things but it won't be an ANOVA.

The most useful distribution is the normal one (because of the CLT), that's why it's the most commonly used. If your data it's not normally distributed you need at least to know what's its distribution in order to calculate something.

Homoscedasticity is a common assumption also in regression analysis, it just makes things easier. We need some assumptions to start with.

If you don't have homoscedasticity you can try transform your data to achieve it.

The ANOVA F-test is known to be nearly optimal in the sense of minimizing false negative errors for a fixed rate of false positive errors

— skan
소스

"ANOVA" refers to the process of decomposing sums of squares into interpretable components. Regardless of the distributional assumptions, then, an ANOVA is an ANOVA.

— whuber