답변:
동점이 없기 때문에 와 모두 에서 까지 의 정수로 구성됩니다 .
따라서 분모를 다시 작성할 수 있습니다.
그러나 분모는 단지 의 함수입니다 .
이제 분자를 살펴 봅시다 :
Numerator/Denominator
.
Hence
We see that in the second formula there appears the squared Euclidean distance between the two (ranked) variables: . The decisive intuition at the start will be how might be related to . It is clearly related via the cosine theorem. If we have the two variables centered, then the cosine in the linked theorem's formula is equal to (it can be easily proved, we'll take here as granted). And (the squared Euclidean norm) is , sum-of-squares in a centered variable. So the theorem's formula looks like this: . Please note also another important thing (which might have to be proved separately): When data are ranks, is the same for centered and not centered data.
Further, since the two variables were ranked, their variances are the same, , so .
. Recall that ranked data are from a discrete uniform distribution having variance . Substituting it into the formula leaves .
The algebra is simpler than it might first appear.
IMHO, there is little profit or insight achieved by belaboring the algebraic manipulations. Instead, a truly simple identity shows why squared differences can be used to express (the usual Pearson) correlation coefficient. Applying this to the special case where the data are ranks produces the result. It exhibits the heretofore mysterious coefficient
as being half the reciprocal of the variance of the ranks . (When ties are present, this coefficient acquires a more complicated formula, but will still be one-half the reciprocal of the variance of the ranks assigned to the data.)
Once you have seen and understood this, the formula becomes memorable. Comparable (but more complex) formulas that handle ties, show up in nonparametric statistical tests like the Wilcoxon rank sum test, or appear in spatial statistics (like Moran's I, Geary's C, and others) become instantly understandable.
Consider any set of paired data with means and and variances and . By recentering the variables at their means and and using their standard deviations and as units of measurement, the data will be re-expressed in terms of the standardized values
By definition, the Pearson correlation coefficient of the original data is the average product of the standardized values,
The Polarization Identity relates products to squares. For two numbers and it asserts
which is easily verified. Applying this to each term in the sum gives
Because the and have been standardized, their average squares are both unity, whence
The correlation coefficient differs from its maximum possible value, , by one-half the mean squared difference of the standardized data.
This is a universal formula for correlation, valid no matter what the original data were (provided only that both variables have nonzero standard deviations). (Faithful readers of this site will recognize this as being closely related to the geometric characterization of covariance described and illustrated at How would you explain covariance to someone who understands only the mean?.)
In the special case where the and are distinct ranks, each is a permutation of the same sequence of numbers . Thus and, with a tiny bit of calculation we find
(which, happily, is nonzero whenever ). Therefore
This nice simplification occurred because the and have the same means and standard deviations: the difference of their means therefore disappeared and the product became which involves no square roots.
Plugging this into the formula for gives
High school students may see the PMCC and Spearman correlation formulae years before they have the algebra skills to manipulate sigma notation, though they may well know the method of finite differences for deducing the polynomial equation for a sequence. So I have tried to write a "high school proof" for the equivalence: finding the denominator using finite differences, and minimising the algebraic manipulation of sums in the numerator. Depending on the students the proof is presented to, you may prefer this approach to the numerator, but combine it with a more conventional method for the denominator.
Denominator,
With no ties, the data are the ranks in some order, so it is easy to show . We can reorder the sum , though with lower grade students I'd likely write this sum out explicitly rather than in sigma notation. The sum of a quadratic in will be cubic in , a fact that students familiar with the finite difference method may grasp intuitively: differencing a cubic produces a quadratic, so summing a quadratic produces a cubic. Determining the coefficients of the cubic is straightforward if students are comfortable manipulating notation and know (and remember!) the formulae for and . But they can also be deduced using finite differences, as follows.
When , the data set is just , , so .
For , the data are , , so .
For , the data are , , so .
These computations are fairly brief, and help reinforce what the notation means, and in short order we produce the finite difference table.
We can obtain the coefficients of by cranking out the finite difference method as outlined in the links above. For instance, the constant third differences indicate our polynomial is indeed cubic, with leading coefficient . There are a few tricks to minimise drudgery: a well-known one is to use the common differences to extend the sequence back to , as knowing immediately gives away the constant coefficient. Another is to try extending the sequence to see if is zero for an integer - e.g. if the sequence had been positive but decreasing, it would be worth extending rightwards to see if we could "catch a root", as this makes factorisation easier later. In our case, the function seems to hover around low values when is small, so let's extend even further leftwards.
Aha! It turns out we have caught all three roots: . So the polynomial has factors of , , and . Since it was cubic it must be of the form:
We can see that must be the coefficient of which we already determined to be . Alternatively, since we have which leads to the same conclusion. Expanding the difference of two squares gives:
Since the same argument applies to , the denominator is and we are done. Ignoring my exposition, this method is surprisingly short. If one can spot that the polynomial is cubic, it is necessary only to calculate for the cases to establish the third difference is 0.5. Root-hunters need only extend the sequence leftwards to and , by when all three roots are found. It took me a couple of minutes to find this way.
Numerator,
I note the identity which can be rearranged to:
If we let and we have the useful result that because the means, being identical, cancel out. That was my intuition for writing the identity in the first place; I wanted to switch from working with the product of the moments to the square of their differences. We now have:
Hopefully even students unsure how to manipulate notation can see how summing over the data set yields:
We have already established, by reordering the sums, that , leaving us with:
The formula for Spearman's correlation coefficient is within our grasp!
Substituting the earlier result that will finish the job.