주성분 분석“뒤로”: 주어진 선형 변수 조합에 의해 얼마나 많은 데이터 차이가 설명됩니까?

6 가지 변수 $A$ , $B$ , $C$ , $D$ , $E$ 및 대한 주성분 분석을 수행했습니다 $F$ . 올바르게 이해하면 회전하지 않은 PC1은 이러한 변수의 선형 조합이 데이터에서 가장 큰 차이를 설명 / 설명하고 PC2는 이러한 변수의 선형 조합이 데이터에서 다음으로 가장 큰 차이를 설명하는 방식을 알려줍니다.

그냥 궁금 해서요-이 "뒤로"하는 방법이 있습니까? 이러한 변수의 선형 조합 (예 : $A+2B+5C$ )을 선택한다고 가정 해 봅시다 .

— N26
소스

엄밀히 말하면, PC2는 PC1 과 직교 하는 선형 조합 으로, 데이터에서 다음으로 가장 큰 차이를 나타냅니다.

— Henry

을 추정하려고 Var(A+2B+5C) $Var(A+2B+5C)$ 합니까?

— vqv

모든 좋은 답변 (3 +1). 하나 이상의 잠재 변수를 "변수의 선형 조합"이라고 생각 하면 공식화 된 문제가 잠재 변수 접근법 (SEM / LVM)을 통해 해결할 수 있는지에 대한 사람들의 의견이 궁금합니다 .

— Aleksandr Blekh

@ Aleksandr, 내 대답은 실제로 다른 두 가지와 직접 충돌합니다. 나는 불일치를 명확히하기 위해 답을 편집했습니다 (그리고 수학을 철자하기 위해 더 편집 할 계획입니다). 두 개의 표준화 된 동일한 변수

로 데이터 세트를 상상해보십시오 X=Y $X=Y$ .

의해 얼마나 많은 분산이 기술 X $X$ 됩니까? 다른 두 가지 솔루션은

50% $50\%$ 합니다. 나는 정답이

라고 주장한다 100% $100\%$ .

— amoeba는 Reinstate Monica

@amoeba : 자료를 완전히 이해하는 데 여전히 어려움을 겪고 있지만 귀하의 답변이 다르다는 것을 이해합니다. 내가 "모두 좋은 대답"이라고 말했을 때, 나는 그 답이 정확 하지 않고 대답 자체 의 수준 을 좋아한다는 것을 암시했다 . 나는 그것이 거친 지형 국가에서 통계 :-) 라는 자체 교육 탐구에있는 나와 같은 사람들에게 교육적 가치 가 있음을 발견했습니다 . 이해가 되길 바랍니다.

— Aleksandr Blekh

답변:

모든 변수가 중심에 있다는 가정 (PCA의 표준 사례)으로 시작하면 데이터의 총 분산은 제곱의 합입니다.

T = \sum i (A 2 i + B 2 i + C 2 i + D 2 i + E 2 i + F 2 i)

$T=\sum_{i}(A_{i}^{2}+B_{i}^{2}+C_{i}^{2}+D_{i}^{2}+E_{i}^{2}+F_{i}^{2})$

이것은 변수의 공분산 행렬의 트레이스와 동일하며, 이는 공분산 행렬의 고유 값의 합과 같습니다. 이것은 "데이터 설명"과 관련하여 PCA가 말하는 것과 같은 양입니다. 즉 PC가 공분산 행렬의 대각선 요소의 가장 큰 비율을 설명하기를 원합니다. 이제 이것을 예측 값 집합에 대한 목적 함수로 만들면 다음과 같습니다.

S = \sum i ([A i - A^i] 2 + \dots + [F i - F^i] 2)

$S=\sum_{i}\left(\left[A_{i}-\hat{A}_{i}\right]^{2}+\dots+\left[F_{i}-\hat{F}_{i}\right]^{2}\right)$

이어서 제 주성분이 최소화 $S$ 모두 랭크 1 개 피팅 값 사이 $(\hat{A}_{i},\dots,\hat{F}_{i})$ . 따라서 당신이 따르는 적절한 양처럼 보일 것입니다

P = 1 - S T

$P=1-\frac{S}{T}$ 예제

를 사용하려면A+2B+5C $A+2B+5C$ 이 방정식을 순위 1 예측으로 바꿔야합니다. 먼저 가중치를 정규화하여 제곱합 1을 갖도록해야합니다. 따라서

(1,2,5,0,0,0) $(1,2,5,0,0,0)$ (제곱합

30 $30$ )을

(130√,230√,530√,0,0,0) $\left(\frac{1}{\sqrt{30}},\frac{2}{\sqrt{30}},\frac{5}{\sqrt{30}},0,0,0\right)$ . 다음으로 정규화 된 가중치에 따라 각 관측치를 "점수"합니다.

Z i = 1 30 - - \sqrt A i + 2 30 - - \sqrt B i + 5 30 - - \sqrt C i

$Z_{i}=\frac{1}{\sqrt{30}}A_{i}+\frac{2}{\sqrt{30}}B_{i}+\frac{5}{\sqrt{30}}C_{i}$

그런 다음 점수에 가중치 벡터를 곱하여 순위 1 예측을 얻습니다.

⎛ ⎝ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ A^i B^i C^i D^i E^i F^i ⎞ ⎠ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ = Z i \times ⎛ ⎝ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ 1 30 \sqrt 2 30 \sqrt 5 30 \sqrt 000 ⎞ ⎠ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟

$\begin{pmatrix} \hat{A}_{i} \\ \hat{B}_{i} \\ \hat{C}_{i} \\ \hat{D}_{i} \\ \hat{E}_{i} \\ \hat{F}_{i}\end{pmatrix} =Z_{i}\times\begin{pmatrix} \frac{1}{\sqrt{30}} \\ \frac{2}{\sqrt{30}} \\ \frac{5}{\sqrt{30}} \\ 0 \\ 0 \\ 0\end{pmatrix}$

그런 다음 이러한 추정치를 $S$ 계산 $P$ 합니다. 이것을 매트릭스 규범 표기법에 넣을 수도 있는데, 이는 다른 일반화를 암시 할 수 있습니다. 우리 가 변수의 관측 값의 행렬 로 $O$ 를 설정하면 ( 귀하의 경우 ), 대응하는 예측 행렬로 를 설정합니다. 분산 비율을 다음과 같이 정의 할 수 있습니다. $N\times q$ $q=6$ $E$

| | O | | 2 2 - | | O - E | | 2 2 | | O | | 2 2

$\frac{||O||_{2}^{2}-||O-E||_{2}^{2}}{||O||_{2}^{2}}$

어디 는 Frobenius 행렬 표준 입니다. 따라서 이것을 다른 종류의 행렬 규범으로 "일반화"할 수 있으며, 제곱의 합이 아니라면 "변형"자체는 아니지만 "변형 설명"의 차이 측정 값을 얻게됩니다. $||.||_{2}$

— 확률 론적
소스

이것은 합리적인 접근 방법이지만 식을 크게 단순화하고

의 제곱합을 총 제곱합

나눈 것과 같게 표시 할 수 있습니다 . 또한 이것이 질문을 해석하는 가장 좋은 방법은 아니라고 생각합니다. 내가 이해하는 다른 접근법에 대한 내 대답을 참조하십시오 (특히, 저의 예제 그림 참조). Zi $Z_i$

T $T$

— amoeba는 Reinstate Monica

그렇게 생각하십시오. 두 개의 표준화 된 동일한 변수

로 데이터 세트를 상상해보십시오 .

의해 얼마나 많은 분산이 기술 됩니까? 계산 결과는

입니다. 나는 정답이

라고 주장한다 . X=Y $X=Y$

X $X$

50% $50\%$

100% $100\%$

— amoeba는 Reinstate Monica

@amoeba-

이면 첫 번째 PC는

X=Y $X=Y$

-이것은

의순위

점을만듭니다.(12√,12√) $(\frac {1}{\sqrt {2}},\frac {1}{\sqrt {2}})$

1 $1$

(

가정). 이 순위를 제공

예측

, 마찬가지로

. 따라서

및

됩니다. 따라서 직관이 제안한대로 100 %를 얻습니다. zi=xi+yi2√=xi2–√ $z_i=\frac {x_i+y_i}{\sqrt {2}}=x_i\sqrt {2}$

xi=yi $x_i=y_i$

1 $1$

x^i=xi $\hat {x}_i=x_i$

y^i=yi $\hat {y}_i=y_i$

O−E=0 $O-E=0$

S=0 $S =0$

— probabilityislogic

Hey, yes, sure, the 1st PC explains 100% variance, but that's not what I meant. What I meant is that

X=Y $X=Y$ , but the question is how much variance is described by

X $X$ , i.e. by

(1,0) $(1,0)$ vector? What does your formula say then?

— amoeba says Reinstate Monica

@amoeba -이 50 %라고하지만 참고는

의 벡터는 가장 순위 말한다

위한 예측기

로 주어진다

및

(주목 그

당신의 벡터의 선택에 따라

). 이것은 최적의 예측이 아니기 때문에 100 %를 얻지 못합니다. 이 설정에서

와

를 모두 예측해야 합니다. (1,0) $(1,0)$

1 $1$

(xi,yi) $(x_i, y_i)$

x^i=xi $\hat {x}_i= x_i$

y^i=0 $\hat {y}_i=0$

zi=xi $z_i=x_i$

X $X$

Y $Y$

— 확률 론적

이러한 변수의 선형 조합 (예 : )을 선택한다고 가정 해 봅시다 . $A+2B+5C$

이 질문은 두 가지 다른 방식으로 이해되어 두 가지 다른 답변으로 이어질 수 있습니다.

A linear combination corresponds to a vector, which in your example is $[1, 2, 5, 0, 0, 0]$ . This vector, in turn, defines an axis in the 6D space of the original variables. What you are asking is, how much variance does projection on this axis "describe"? The answer is given via the notion of "reconstruction" of original data from this projection, and measuring the reconstruction error (see Wikipedia on Fraction of variance unexplained). Turns out, this reconstruction can be reasonably done in two different ways, yielding two different answers.

Approach #1

하자 센터링 된 데이터 세트 일 수 ( 행은 샘플에 대응하는 열 변수에 대응)하도록 그 공분산 행렬 및하자 로부터 단위 벡터 수 . 데이터 세트의 총 분산은 모든 분산 의 합 , 즉 공분산 행렬의 트레이스입니다. . 질문 : 어떤 비율 않습니다 설명? @todddeluca와 @probabilityislogic의 두 가지 대답은 모두 다음과 같습니다. 계산 프로젝션 $\newcommand{\S}{\boldsymbol \Sigma} \newcommand{\w}{\mathbf w} \newcommand{\v}{\mathbf v}\newcommand{\X}{\mathbf X} \X$ $n$ $d$ $\S$ $\w$ $\mathbb R^d$ $d$ $T = \mathrm{tr}(\S)$ $T$ $\w$ $\X \w$ , compute its variance and divide by $T$ :

R 2 f i r s t = V a r ( X w ) T = w ⊤ Σ w t r ( Σ ) .

$R^2_\mathrm{first} = \frac{\mathrm{Var}(\X \w)}{T} = \frac{\w^\top \S \w}{\mathrm{tr}(\S)}.$

This might not be immediately obvious, because e.g. @probabilityislogic suggests to consider the reconstruction $\X \w \w^\top$ and then to compute

∥ X ∥ 2 - ∥ X - X w w ⊤ ∥ 2 ∥ X ∥ 2,

$\frac{\|\X\|^2 - \|\X-\X \w \w^\top\|^2}{\|\X\|^2},$ but with a little algebra this can be shown to be an equivalent expression.

Approach #2

Okay. Now consider a following example: $\X$ is a $d=2$ dataset with covariance matrix

Σ = (1 0.99 0.99 1)

$\S = \left(\begin{array}{c}1&0.99\\0.99&1\end{array}\right)$ and

w=(10)⊤ $\mathbf w = (\begin{array}{}1&0\end{array})^\top$ is simply an

x $x$ vector:

variance explained

The total variance is $T=2$ . The variance of the projection onto $\w$ (shown in red dots) is equal to $1$ . So according to the above logic, the explained variance is equal to $1/2$ . And in some sense it is: red dots ("reconstruction") are far away from the corresponding blue dots, so a lot of the variance is "lost".

On the other hand, the two variables have $0.99$ correlation and so are almost identical; saying that one of them describes only $50\%$ of the total variance is weird, because each of them contains "almost all the information" about the second one. We can formalize it as follows: given projection $\X\w$ , find a best possible reconstruction $\X\w\v^\top$ with $\v$ not necessarily the same as $\w$ , and then compute the reconstruction error and plug it into the expression for the proportion of explained variance:

R 2 s e c o n d = ∥ X ∥ 2 - ∥ X - X w v ⊤ ∥ 2 ∥ X ∥ 2,

$R^2_\mathrm{second}=\frac{\|\X\|^2 - \|\X-\X \w \v^\top\|^2}{\|\X\|^2},$ where

v $\v$ is chosen such that

∥X−Xwv⊤∥2 $\|\X-\X \w \v^\top\|^2$ is minimal (i.e.

R2 $R^2$ is maximal). This is exactly equivalent to computing

R2 $R^2$ of multivariate regression predicting original dataset

X $\X$ from the

1 $1$ -dimensional projection

Xw $\X\w$ .

It is a matter of straightforward algebra to use regression solution for $\v$ to find that the whole expression simplifies to

R 2 s e c o n d = ∥ Σ w ∥ 2 w ⊤ Σ w \cdot t r ( Σ ) .

$R^2_\mathrm{second}=\frac{\|\S \w\|^2}{\w^\top \S \w \cdot \mathrm{tr}(\S)}.$ In the example above this is equal to

0.9901 $0.9901$ , which seems reasonable.

Note that if (and only if) $\w$ is one of the eigenvectors of $\S$ , i.e. one of the principal axes, with eigenvalue $\lambda$ (so that $\S \w = \lambda \w$ ), then both approaches to compute $R^2$ coincide and reduce to the familiar PCA expression

R 2 P C A = R 2 f i r s t = R 2 s e c o n d = λ / t r (Σ) = λ / \sum λ i .

$R^2_\mathrm{PCA} = R^2_\mathrm{first} = R^2_\mathrm{second} = \lambda/\mathrm{tr}(\S) = \lambda/\sum \lambda_i.$

PS. See my answer here for an application of the derived formula to the special case of $\w$ being one of the basis vectors: Variance of the data explained by a single variable.

Appendix. Derivation of the formula for $R^2_\mathrm{second}$

Finding $\v$ minimizing the reconstruction $\|\X-\X \w \v^\top\|^2$ is a regression problem (with $\X \w$ as univariate predictor and $\X$ as multivariate response). Its solution is given by

v ⊤ = ((X w) ⊤ (X w)) - 1 (X w) ⊤ X = (w ⊤ Σ w) - 1 w ⊤ Σ .

$\v^\top = \left((\X \w)^\top (\X \w)\right)^{-1}(\X \w)^\top \X = (\w^\top \S \w)^{-1} \w^\top \S.$

Next, the $R^2$ formula can be simplified as

R 2 = ∥ X ∥ 2 - ∥ X - X w v ⊤ ∥ 2 ∥ X ∥ 2 = ∥ X w v ⊤ ∥ 2 ∥ X ∥ 2

$R^2=\frac{\|\X\|^2 - \|\X-\X \w \v^\top\|^2}{\|\X\|^2} = \frac{\|\X \w \v^\top\|^2}{\|\X\|^2}$ due to the Pythagoras theorem, because the hat matrix in regression is an orthogonal projection (but it is also easy to show directly).

Plugging now the equation for $\v$ , we obtain for the numerator:

∥ X w v ⊤ ∥ 2 = t r (X w v ⊤ (X w v ⊤) ⊤) = t r (X w w ⊤ Σ Σ w w ⊤ X ⊤) / (w ⊤ Σ w) 2 = t r (w ⊤ Σ Σ w) / (w ⊤ Σ w) = ∥ Σ w ∥ 2 / (w ⊤ Σ w) .

$\|\X \w \v^\top\|^2 = \mathrm{tr}\left(\X \w \v^\top (\X \w \v^\top)^\top\right) = \mathrm{tr}(\X\w\w^\top\S\S\w\w^\top\X^\top)/(\w^\top\S\w)^2=\mathrm{tr}(\w^\top\S\S\w)/(\w^\top\S\w) = \|\S\w\|^2 / (\w^\top\S\w).$

The denominator is equal to $\|\X\|^2 = \mathrm{tr}(\S)$ resulting in the formula given above.

— amoeba says Reinstate Monica
소스

I think this is an answer to a different question. For example, it not the case that that optimising your

R2 $R^2$ wrt

w $w$ will give the first PC as the unique answer (in those cases where it is unique). The fact that

(1,0) $(1,0)$ and

12√(1,1) $\frac {1}{\sqrt {2}}(1,1)$ both give 100% when

X=Y $X=Y$ is evidence enough. Your proposed method seems to assume that the "normalised" objective function for PCA will always understate the variance explained (yours isn't a normalised PCA objective function as it normalises by the quantity being optimised in PCA).

— probabilityislogic

I agree that our answers are to different questions, but it's not clear to me which one OP had in mind. Also, note that my interpretation is not something very weird: it's a standard regression approach: when we say that

x $x$ explains so and so much variance in

y $y$ , we compute reconstruction error of

∥y−xb∥ $\|y-xb\|$ with an optimal

b $b$ , not just

∥y−x∥ $\|y-x\|$ . Here is another argument: if all

n $n$ variables are standardized, then in your approach each one explains

1/n $1/n$ amount of variance. This is not very informative: some variables can be much more predictive than others! My approach reflects that.

— amoeba says Reinstate Monica

@amoeba (+1) Great answer, it's really helpful! Would you know any reference that tackles this issue? Thanks!

— PierreE

@PierreE Thanks. No, I don't think I have any reference for that.

— amoeba says Reinstate Monica

Let the total variance, $T$ , in a data set of vectors be the sum of squared errors (SSE) between the vectors in the data set and the mean vector of the data set,

T = \sum i (x i - x ¯) \cdot (x i - x ¯)

$T = \sum_{i} (x_i-\bar{x}) \cdot (x_i-\bar{x})$ where

x¯ $\bar{x}$ is the mean vector of the data set,

xi $x_i$ is the ith vector in the data set, and

⋅ $\cdot$ is the dot product of two vectors. Said another way, the total variance is the SSE between each

xi $x_i$ and its predicted value,

f(xi) $f(x_i)$ , when we set

f(xi)=x¯ $f(x_i)=\bar{x}$ .

Now let the predictor of $x_i$ , $f(x_i)$ , be the projection of vector $x_i$ onto a unit vector $c$ .

f c (x i) = (c \cdot x i) c

$f_c(x_i) = (c \cdot x_i)c$

Then the $SSE$ for a given $c$ is

S S E c = \sum i (x i - f c (x i)) \cdot (x i - f c (x i))

$SSE_c = \sum_i (x_i - f_c(x_i)) \cdot (x_i - f_c(x_i))$

I think that if you choose $c$ to minimize $SSE_c$ , then $c$ is the first principal component.

If instead you choose $c$ to be the normalized version of the vector $(1, 2, 5, ...)$ , then $T-SSE_c$ is the variance in the data described by using $c$ as a predictor.

— todddeluca
소스

This is a reasonable approach, but I think this is not the best way to interpret the question; see my answer for an alternative approach that I argue makes more sense (in particular, see my example figure there).

— amoeba says Reinstate Monica

Think about it like that. Imagine a dataset with two standardized identical variables

$X=Y$ . How much variance is described by

$X$ ? Your calculation gives

$50\%$ . I argue that the correct answer is

$100\%$ .

— amoeba says Reinstate Monica

주성분 분석“뒤로”: 주어진 선형 변수 조합에 의해 얼마나 많은 데이터 차이가 설명됩니까?

Approach #1

Approach #2

Appendix. Derivation of the formula for R2secondR2secondR^2_\mathrm{second}

Appendix. Derivation of the formula for $R^2_\mathrm{second}$