간단히 말해서, ANOVA는 잔차를 추가 , 제곱 및 평균화 하고 있습니다. 잔차는 모형이 데이터에 얼마나 적합한 지 알려줍니다. 이 예제에서는 다음에서 데이터 세트를 사용 했습니다 .PlantGrowth
R
대조군 및 2 가지 상이한 처리 조건 하에서 수득 된 수율 (식물의 건조 중량에 의해 측정 된)을 비교하기위한 실험의 결과.
이 첫 번째 줄거리는 세 가지 치료 수준에 대한 총 평균을 보여줍니다.
빨간색 선은 잔차 입니다. 이제 개별 선의 길이를 제곱하고 길이를 더하여 평균 (모델)이 데이터를 얼마나 잘 설명하는지 알려주는 값을 얻게됩니다. 작은 숫자는 평균이 데이터 점을 잘 설명하고 큰 숫자는 평균이 데이터를 잘 설명하지 않음을 나타냅니다. 이 숫자를 총 제곱합 이라고합니다 .
SStotal=∑(xi−x¯grand)2xix¯grand
이제 처리 잔차 ( 처리 레벨 의 노이즈 라고도하는 잔차 제곱) 에 대해 동일한 작업을 수행합니다 .
그리고 공식 :
SSresiduals=∑(xik−x¯k)2xik are the individual data points i in the k number of levels and x¯k the mean across the treatment levels.
Lastly, we need to determine the signal in the data, which is known as the Model Sums of Squares, which will later be used to calculate whether the treatment means are any different from the grand mean:
And the formula:
SSmodel=∑nk(x¯k−x¯grand)2, where nk is the sample size n in your k number of levels, and x¯k as well as x¯grand the mean within and across the treatment levels, respectively.
Now the disadvantage with the sums of squares is that they get bigger as the sample size increase. To express those sums of squares relative to the number of observation in the data set, you divide them by their degrees of freedom turning them into variances. So after squaring and adding your data points you are now averaging them using their degrees of freedom:
dftotal=(n−1)
dfresidual=(n−k)
dfmodel=(k−1)
where n is the total number of observations and k the number of treatment levels.
This results in the Model Mean Square and the Residual Mean Square (both are variances), or the signal to noise ratio, which is known as the F-value:
MSmodel=SSmodeldfmodel
MSresidual=SSresidualdfresidual
F=MSmodelMSresidual
The F-value describes the signal to noise ratio, or whether the treatment means are any different from the grand mean. The F-value is now used to calculate p-values and those will decide whether at least one of the treatment means will be significantly different from the grand mean or not.
Now I hope you can see that the assumptions are based on calculations with residuals and why they are important. Since we adding, squaring and averaging residuals, we should make sure that before we are doing this, the data in those treatment groups behaves similar, or else the F-value may be biased to some degree and inferences drawn from this F-value may not be valid.
Edit: I added two paragraphs to address the OP's question 2 and 1 more specifically.
Normality assumption:
The mean (or expected value) is often used in statistics to describe the center of a distribution, however it is not very robust and easily influenced by outliers. The mean is the simplest model we can fit to the data. Since in ANOVA we are using the mean to calculate the residuals and the sums of squares (see formulae above), the data should be roughly normally distributed (normality assumption). If this is not the case, the mean may not be the appropriate model for the data since it wouldn’t give us a correct location of the center of the sample distribution. Instead once could use the median for example (see non parametric testing procedures).
Homogeneity of variance assumption:
Later when we calculate the mean squares (model and residual), we are pooling the individual sums of squares from the treatment levels and averaging them (see formulae above). By pooling and averaging we are losing the information of the individual treatment level variances and their contribution to the mean squares. Therefore, we should have roughly the same variance among all treatment levels so that the contribution to the mean squares is similar. If the variances between those treatment levels were different, then the resulting mean squares and F-value would be biased and will influence the calculation of the p-values making inferences drawn from these p-values questionable (see also @whuber 's comment and @Glen_b 's answer).
This is how I see it for myself. It may not be 100% accurate (I am not a statistician) but it helps me understanding why satisfying the assumptions for ANOVA is important.