파이썬의 주성분 분석 및 회귀

11

SAS에서 수행 한 일부 작업을 Python으로 재현하는 방법을 찾으려고합니다. 다중 공선 성이 문제가되는 이 데이터 세트를 사용하여 Python에서 주요 구성 요소 분석을 수행하고 싶습니다. scikit-learn 및 statsmodels를 살펴 보았지만 출력을 가져 와서 SAS와 동일한 결과 구조로 변환하는 방법을 모르겠습니다. 우선, SAS를 사용할 때 SAS가 상관 관계 매트릭스에서 PCA를 수행하는 것처럼 PROC PRINCOMP보이지만 대부분의 Python 라이브러리는 SVD를 사용하는 것으로 보입니다.

에서는 데이터 세트 의 첫 번째 열 응답 변수이며 다음의 5-pred1 pred5라고 예측 변수이다.

SAS에서 일반적인 워크 플로우는 다음과 같습니다.

/* Get the PCs */
proc princomp data=indata out=pcdata;
    var pred1 pred2 pred3 pred4 pred5;
run;

/* Standardize the response variable */
proc standard data=pcdata mean=0 std=1 out=pcdata2;
    var response;
run;

/* Compare some models */
proc reg data=pcdata2;
    Reg:     model response = pred1 pred2 pred3 pred4 pred5 / vif;
    PCa:     model response = prin1-prin5 / vif;
    PCfinal: model response = prin1 prin2 / vif;
run;
quit;

/* Use Proc PLS to to PCR Replacement - dropping pred5 */
/* This gets me my parameter estimates for the original data */
proc pls data=indata method=pcr nfac=2;
    model response = pred1 pred2 pred3 pred4 / solution;
run;
quit;

마지막 단계는 PC1과 PC2 만 순서대로 선택하기 때문에 작동한다는 것을 알고 있습니다.

그래서 파이썬에서 이것은 내가 얻은 한

import pandas as pd
import numpy  as np
from sklearn.decomposition.pca import PCA

source = pd.read_csv('C:/sourcedata.csv')

# Create a pandas DataFrame object
frame = pd.DataFrame(source)

# Make sure we are working with the proper data -- drop the response variable
cols = [col for col in frame.columns if col not in ['response']]
frame2 = frame[cols]

pca = PCA(n_components=5)
pca.fit(frame2)

각 PC가 설명하는 분산의 양은 무엇입니까?

print pca.explained_variance_ratio_

Out[190]:
array([  9.99997603e-01,   2.01265023e-06,   2.70712663e-07,
         1.11512302e-07,   2.40310191e-09])

이것들은 무엇입니까? 고유 벡터?

print pca.components_

Out[179]:
array([[ -4.32840645e-04,  -7.18123771e-04,  -9.99989955e-01,
         -4.40303223e-03,  -2.46115129e-05],
       [  1.00991662e-01,   8.75383248e-02,  -4.46418880e-03,
          9.89353169e-01,   5.74291257e-02],
       [ -1.04223303e-02,   9.96159390e-01,  -3.28435046e-04,
         -8.68305757e-02,  -4.26467920e-03],
       [ -7.04377522e-03,   7.60168675e-04,  -2.30933755e-04,
          5.85966587e-02,  -9.98256573e-01],
       [ -9.94807648e-01,  -1.55477793e-03,  -1.30274879e-05,
          1.00934650e-01,   1.29430210e-02]])

이것이 고유 값입니까?

print pca.explained_variance_

Out[180]:
array([  8.07640319e+09,   1.62550137e+04,   2.18638986e+03,
         9.00620474e+02,   1.94084664e+01])

Python 결과에서 Principal Component Regression (Python)을 실제로 수행하는 방법에 약간의 손실이 있습니다. 파이썬 라이브러리 중 어느 것이 SAS와 유사하게 공백을 채우나요?

모든 팁을 부탁드립니다. SAS 출력에서 레이블을 사용하는 데 약간의 허를 띠고 팬더, numpy, scipy 또는 scikit-learn에 익숙하지 않습니다.

편집하다:

따라서 팬더 데이터 프레임에서 sklearn이 직접 작동하지 않는 것처럼 보입니다. 그것을 numpy 배열로 변환한다고 가정 해 봅시다.

npa = frame2.values
npa

내가 얻는 것은 다음과 같습니다.

Out[52]:
array([[  8.45300000e+01,   4.20730000e+02,   1.99443000e+05,
          7.94000000e+02,   1.21100000e+02],
       [  2.12500000e+01,   2.73810000e+02,   4.31180000e+04,
          1.69000000e+02,   6.28500000e+01],
       [  3.38200000e+01,   3.73870000e+02,   7.07290000e+04,
          2.79000000e+02,   3.53600000e+01],
       ..., 
       [  4.71400000e+01,   3.55890000e+02,   1.02597000e+05,
          4.07000000e+02,   3.25200000e+01],
       [  1.40100000e+01,   3.04970000e+02,   2.56270000e+04,
          9.90000000e+01,   7.32200000e+01],
       [  3.85300000e+01,   3.73230000e+02,   8.02200000e+04,
          3.17000000e+02,   4.32300000e+01]])

그런 다음 copysklearn의 PCA 매개 변수를 False,아래 주석에 따라 배열에서 직접 작동 하도록 변경하십시오 .

pca = PCA(n_components=5,copy=False)
pca.fit(npa)

npa

출력 npa에 따라 배열에 아무것도 추가 하는 대신 모든 값을 바꾼 것처럼 보입니다 . npa지금 의 가치는 무엇입니까 ? 원래 배열의 주성분 점수는?

Out[64]:
array([[  3.91846649e+01,   5.32456568e+01,   1.03614689e+05,
          4.06726542e+02,   6.59830027e+01],
       [ -2.40953351e+01,  -9.36743432e+01,  -5.27103110e+04,
         -2.18273458e+02,   7.73300268e+00],
       [ -1.15253351e+01,   6.38565684e+00,  -2.50993110e+04,
         -1.08273458e+02,  -1.97569973e+01],
       ..., 
       [  1.79466488e+00,  -1.15943432e+01,   6.76868901e+03,
          1.97265416e+01,  -2.25969973e+01],
       [ -3.13353351e+01,  -6.25143432e+01,  -7.02013110e+04,
         -2.88273458e+02,   1.81030027e+01],
       [ -6.81533512e+00,   5.74565684e+00,  -1.56083110e+04,
         -7.02734584e+01,  -1.18869973e+01]])

pca python scikit-learn

— 점토
소스

1

scikit-learn에서 각 샘플은 데이터 매트릭스에 행 으로 저장됩니다 . PCA 클래스는 데이터 매트릭스에서 직접 작동합니다. 즉 공분산 매트릭스 를 계산 한 다음 고유 벡터를 처리합니다. 마지막 3 가지 질문에 대해 예, components_는 공분산 행렬의 고유 벡터이고, DESCRIPTION_variance_ratio_는 각 PC가 설명하는 분산이며, 설명 된 분산은 고유 값과 일치해야합니다.

— lightalchemist

@lightalchemist 설명해 주셔서 감사합니다. sklearn을 사용하면 PCA를 수행하기 전에 새 데이터 프레임을 작성하는 것이 적절합니까? 또는 '완전한'팬더 데이터 프레임을 전송하여 가장 왼쪽 (응답) 열에서 작동하지 않을 수 있습니까?

— Clay

좀 더 정보를 추가했습니다. 먼저 numpy 배열로 변환 한 다음로 PCA를 실행 copy=False하면 새로운 값을 얻습니다. 그것들은 주성분 점수입니까?

— Clay

팬더에 익숙하지 않으므로 질문의 해당 부분에 대한 답변이 없습니다. 두 번째 부분에 관해서는 그들이 주요 구성 요소라고 생각하지 않습니다. 나는 그들이 원래의 데이터 샘플이지만 평균을 뺀 것으로 생각합니다. 그러나 나는 그것에 대해 정말로 확신 할 수 없다.

— lightalchemist

16

Scikit-learn은 예를 들어 R 의 pls 패키지 와 같이 PCA와 회귀를 결합하여 구현하지는 않지만 아래처럼하거나 PLS 회귀를 선택할 수 있다고 생각합니다.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.preprocessing import scale
from sklearn.decomposition import PCA
from sklearn import cross_validation
from sklearn.linear_model import LinearRegression

%matplotlib inline

import seaborn as sns
sns.set_style('darkgrid')

df = pd.read_csv('multicollinearity.csv')
X = df.iloc[:,1:6]
y = df.response

사이 킷 학습 PCA

pca = PCA()

주요 구성 요소를 얻기 위해 데이터 확장 및 변환

X_reduced = pca.fit_transform(scale(X))

주성분이 설명하는 분산 (% 누적)

np.cumsum(np.round(pca.explained_variance_ratio_, decimals=4)*100)

array([  73.39,   93.1 ,   98.63,   99.89,  100.  ])

처음 두 구성 요소가 실제로 데이터의 대부분의 차이를 설명하는 것처럼 보입니다.

10 배 CV, 셔플 포함

n = len(X_reduced)
kf_10 = cross_validation.KFold(n, n_folds=10, shuffle=True, random_state=2)

regr = LinearRegression()
mse = []

하나의 CV를 수행하여 절편에 대한 MSE를 얻습니다 (회귀에 주요 구성 요소 없음)

score = -1*cross_validation.cross_val_score(regr, np.ones((n,1)), y.ravel(), cv=kf_10, scoring='mean_squared_error').mean()    
mse.append(score)

5 가지 주요 구성 요소에 대해 CV를 수행하여 한 번에 하나의 구성 요소를 회귀에 추가

for i in np.arange(1,6):
    score = -1*cross_validation.cross_val_score(regr, X_reduced[:,:i], y.ravel(), cv=kf_10, scoring='mean_squared_error').mean()
    mse.append(score)

fig, (ax1, ax2) = plt.subplots(1,2, figsize=(12,5))
ax1.plot(mse, '-v')
ax2.plot([1,2,3,4,5], mse[1:6], '-v')
ax2.set_title('Intercept excluded from plot')

for ax in fig.axes:
    ax.set_xlabel('Number of principal components in regression')
    ax.set_ylabel('MSE')
    ax.set_xlim((-0.2,5.2))

Scikit-learn PLS 회귀

mse = []

kf_10 = cross_validation.KFold(n, n_folds=10, shuffle=True, random_state=2)

for i in np.arange(1, 6):
    pls = PLSRegression(n_components=i, scale=False)
    pls.fit(scale(X_reduced),y)
    score = cross_validation.cross_val_score(pls, X_reduced, y, cv=kf_10, scoring='mean_squared_error').mean()
    mse.append(-score)

plt.plot(np.arange(1, 6), np.array(mse), '-v')
plt.xlabel('Number of principal components in PLS regression')
plt.ylabel('MSE')
plt.xlim((-0.2, 5.2))

— 조르디
소스

7

다음은 Python 및 NumPy의 SVD입니다 (몇 년 후).
(이 SSA / sklearn / 팬더 전혀하지만, 도움이 될 문제에 대한 해결하지 않는 pythonist의 언젠가.)

#!/usr/bin/env python2
""" SVD straight up """
# geometry: see http://www.ams.org/samplings/feature-column/fcarc-svd

from __future__ import division
import sys
import numpy as np

__version__ = "2015-06-15 jun  denis-bz-py t-online de"

# from bz.etc import numpyutil as nu
def ints( x ):
    return np.round(x).astype(int)  # NaN Inf -> - maxint

def quantiles( x ):
    return "quantiles %s" % ints( np.percentile( x, [0, 25, 50, 75, 100] ))


#...........................................................................
csvin = "ccheaton-multicollinearity.csv"  # https://gist.github.com/ccheaton/8393329
plot = 0

    # to change these vars in sh or ipython, run this.py  csvin=\"...\"  plot=1  ...
for arg in sys.argv[1:]:
    exec( arg )

np.set_printoptions( threshold=10, edgeitems=10, linewidth=120,
    formatter = dict( float = lambda x: "%.2g" % x ))  # float arrays %.2g

#...........................................................................
yX = np.loadtxt( csvin, delimiter="," )
y = yX[:,0]
X = yX[:,1:]
print "read %s" % csvin
print "y %d  %s" % (len(y), quantiles(y))
print "X %s  %s" % (X.shape, quantiles(X))
print ""

#...........................................................................
U, sing, Vt = np.linalg.svd( X, full_matrices=False )
#...........................................................................

print "SVD: %s -> U %s . sing diagonal . Vt %s" % (
        X.shape, U.shape, Vt.shape )
print "singular values:", ints( sing )
    # % variance (sigma^2) explained != % sigma explained, e.g. 10 1 1 1 1

var = sing**2
var *= 100 / var.sum()
print "% variance ~ sing^2:", var

print "Vt, the right singular vectors  * 100:\n", ints( Vt * 100 )
    # multicollinear: near +- 100 in each row / col

yU = y.dot( U )
yU *= 100 / yU.sum()
print "y ~ these percentages of U, the left singular vectors:", yU

-> 로그

# from: test-pca.py
# run: 15 Jun 2015 16:45  in ~bz/py/etc/data/etc  Denis-iMac 10.8.3
# versions: numpy 1.9.2  scipy 0.15.1   python 2.7.6   mac 10.8.3

read ccheaton-multicollinearity.csv
y 373  quantiles [  2823  60336  96392 147324 928560]
X (373, 5)  quantiles [     7     47    247    573 512055]

SVD: (373, 5) -> U (373, 5) . sing diagonal . Vt (5, 5)
singular values: [2537297    4132    2462     592      87]
% variance ~ sing^2: [1e+02 0.00027 9.4e-05 5.4e-06 1.2e-07]
Vt, the right singular vectors  * 100:
[[  0   0 100   0   0]
 [  1  98   0 -12  17]
 [-10 -11   0 -99  -6]
 [  1 -17   0  -4  98]
 [-99   2   0  10   2]]
y ~ these percentages of U, the left singular vectors: [1e+02 15 -18 0.88 -0.57]

— 거부
소스

나는 파티에 조금 늦었지만 큰 대답

— plumbus_bouquet

3

파이프 라인을 사용하여 기본 성분 분석과 선형 회귀를 결합하십시오.

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

# Principle components regression
steps = [
    ('scale', StandardScaler()),
    ('pca', PCA()),
    ('estimator', LinearRegression())
]
pipe = Pipeline(steps)
pca = pipe.set_params(pca__n_components=3)
pca.fit(X, y)

— 조
소스

3

내 대답은 거의 5 년 늦게오고 더 이상 Python에서 PCR을 수행하는 데 도움이 필요하지 않을 가능성이 큽니다. 우리는 그 당시에 필요한 것을 정확하게 수행하는 hoggorm 이라는 Python 패키지를 개발했습니다 . 여기 에서 PCR 예제를 살펴보십시오 . hoggorm 으로 계산 된 결과를 시각화하기 위해 hoggormplot 이라는 보완적인 플로팅 패키지도 있습니다 .

— 올리
소스