Pythonでの主成分分析と回帰

11

私はSASで行ったいくつかの作業をPythonで再現する方法を理解しようとしています。多重共線性が問題となるこのデータセットを使用して、Pythonで主成分分析を実行したいと思います。scikit-learnとstatsmodelsを確認しましたが、それらの出力を取得してSASと同じ結果構造に変換する方法がわかりません。1つには、SASを使用するPROC PRINCOMPと相関行列でPCAを実行するように見えますが、Pythonライブラリのほとんど（すべて？）はSVDを使用しているようです。

データセット、最初の列は、応答変数であり、次の5はpred1-pred5呼ばれる予測変数です。

SASでは、一般的なワークフローは次のとおりです。

/* Get the PCs */
proc princomp data=indata out=pcdata;
    var pred1 pred2 pred3 pred4 pred5;
run;

/* Standardize the response variable */
proc standard data=pcdata mean=0 std=1 out=pcdata2;
    var response;
run;

/* Compare some models */
proc reg data=pcdata2;
    Reg:     model response = pred1 pred2 pred3 pred4 pred5 / vif;
    PCa:     model response = prin1-prin5 / vif;
    PCfinal: model response = prin1 prin2 / vif;
run;
quit;

/* Use Proc PLS to to PCR Replacement - dropping pred5 */
/* This gets me my parameter estimates for the original data */
proc pls data=indata method=pcr nfac=2;
    model response = pred1 pred2 pred3 pred4 / solution;
run;
quit;

PC1とPC2を順番に選択するだけなので、最後の手順が機能することはわかっています。

だから、Pythonでは、これは私が得た限りです：

import pandas as pd
import numpy  as np
from sklearn.decomposition.pca import PCA

source = pd.read_csv('C:/sourcedata.csv')

# Create a pandas DataFrame object
frame = pd.DataFrame(source)

# Make sure we are working with the proper data -- drop the response variable
cols = [col for col in frame.columns if col not in ['response']]
frame2 = frame[cols]

pca = PCA(n_components=5)
pca.fit(frame2)

各PCが説明する分散の量？

print pca.explained_variance_ratio_

Out[190]:
array([  9.99997603e-01,   2.01265023e-06,   2.70712663e-07,
         1.11512302e-07,   2.40310191e-09])

これは何？固有ベクトル？

print pca.components_

Out[179]:
array([[ -4.32840645e-04,  -7.18123771e-04,  -9.99989955e-01,
         -4.40303223e-03,  -2.46115129e-05],
       [  1.00991662e-01,   8.75383248e-02,  -4.46418880e-03,
          9.89353169e-01,   5.74291257e-02],
       [ -1.04223303e-02,   9.96159390e-01,  -3.28435046e-04,
         -8.68305757e-02,  -4.26467920e-03],
       [ -7.04377522e-03,   7.60168675e-04,  -2.30933755e-04,
          5.85966587e-02,  -9.98256573e-01],
       [ -9.94807648e-01,  -1.55477793e-03,  -1.30274879e-05,
          1.00934650e-01,   1.29430210e-02]])

これらは固有値ですか？

print pca.explained_variance_

Out[180]:
array([  8.07640319e+09,   1.62550137e+04,   2.18638986e+03,
         9.00620474e+02,   1.94084664e+01])

Pythonの結果から実際に主成分回帰（Pythonで）を実行する方法については、少し途方に暮れています。SASと同様に、Pythonライブラリのいずれかが空白を埋めていますか？

ヒントがあれば大歓迎です。私はSAS出力でのラベルの使用に少し甘やかされており、パンダ、numpy、scipy、またはscikit-learnにはあまり詳しくありません。

編集：

したがって、sklearnがパンダのデータフレームを直接操作しないようです。それを派手な配列に変換するとしましょう：

npa = frame2.values
npa

ここに私が得るものがあります：

Out[52]:
array([[  8.45300000e+01,   4.20730000e+02,   1.99443000e+05,
          7.94000000e+02,   1.21100000e+02],
       [  2.12500000e+01,   2.73810000e+02,   4.31180000e+04,
          1.69000000e+02,   6.28500000e+01],
       [  3.38200000e+01,   3.73870000e+02,   7.07290000e+04,
          2.79000000e+02,   3.53600000e+01],
       ..., 
       [  4.71400000e+01,   3.55890000e+02,   1.02597000e+05,
          4.07000000e+02,   3.25200000e+01],
       [  1.40100000e+01,   3.04970000e+02,   2.56270000e+04,
          9.90000000e+01,   7.32200000e+01],
       [  3.85300000e+01,   3.73230000e+02,   8.02200000e+04,
          3.17000000e+02,   4.32300000e+01]])

次にcopy、sklearnのPCAのパラメーターを変更すると、False,以下のコメントに従って、アレイで直接動作します。

pca = PCA(n_components=5,copy=False)
pca.fit(npa)

npa

出力ごとnpaに、配列に何かを追加するのではなく、すべての値を置き換えたように見えます。今の価値は何npaですか？元の配列の主成分スコアは？

Out[64]:
array([[  3.91846649e+01,   5.32456568e+01,   1.03614689e+05,
          4.06726542e+02,   6.59830027e+01],
       [ -2.40953351e+01,  -9.36743432e+01,  -5.27103110e+04,
         -2.18273458e+02,   7.73300268e+00],
       [ -1.15253351e+01,   6.38565684e+00,  -2.50993110e+04,
         -1.08273458e+02,  -1.97569973e+01],
       ..., 
       [  1.79466488e+00,  -1.15943432e+01,   6.76868901e+03,
          1.97265416e+01,  -2.25969973e+01],
       [ -3.13353351e+01,  -6.25143432e+01,  -7.02013110e+04,
         -2.88273458e+02,   1.81030027e+01],
       [ -6.81533512e+00,   5.74565684e+00,  -1.56083110e+04,
         -7.02734584e+01,  -1.18869973e+01]])

pca python scikit-learn

— 粘土
ソース

1

scikit-learnでは、各サンプルはデータマトリックスの行として保存されます。PCAクラスは、データ行列を直接操作します。つまり、共分散行列を計算してから、その固有ベクトルを計算します。最後の3つの質問については、はい、components_は共分散行列の固有ベクトル、explained_variance_ratio_は各PCが説明する分散、説明された分散は固有値に対応している必要があります。

— 光錬金術師2014年

@lightalchemist説明をありがとうございます。sklearnを使用して、PCAを実行する前に新しいデータフレームを作成することは適切ですか、または「完全な」パンダデータフレームを送信して、左端の（応答）列で動作しないようにすることは可能ですか？

— クレイ

もう少し情報を追加しました。最初にnumpy配列に変換してからでPCAを実行するとcopy=False、新しい値が取得されます。それらは主成分スコアですか？

— クレイ、

私はパンダにあまり慣れていないので、あなたの質問のその部分への返事はありません。第二部に関しては、それらが主成分ではないと思います。私はそれらが元のデータサンプルであると信じていますが、平均が差し引かれています。しかし、私はそれについて本当に確信が持てません。

— 光錬金術師2014年

15

Scikit-learnには、R のplsパッケージなどのPCAと回帰の組み合わせ実装はありません。ただし、以下のようにしたり、PLS回帰を選択したりできると思います。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.preprocessing import scale
from sklearn.decomposition import PCA
from sklearn import cross_validation
from sklearn.linear_model import LinearRegression

%matplotlib inline

import seaborn as sns
sns.set_style('darkgrid')

df = pd.read_csv('multicollinearity.csv')
X = df.iloc[:,1:6]
y = df.response

Scikit-learn PCA

pca = PCA()

データをスケーリングおよび変換して主成分を取得する

X_reduced = pca.fit_transform(scale(X))

主成分によって説明される分散（％累積）

np.cumsum(np.round(pca.explained_variance_ratio_, decimals=4)*100)

array([  73.39,   93.1 ,   98.63,   99.89,  100.  ])

最初の2つのコンポーネントが実際にデータの分散のほとんどを説明しているようです。

10倍CV、シャッフル付き

n = len(X_reduced)
kf_10 = cross_validation.KFold(n, n_folds=10, shuffle=True, random_state=2)

regr = LinearRegression()
mse = []

1つのCVを実行して、切片のみのMSEを取得します（回帰には主成分はありません）。

score = -1*cross_validation.cross_val_score(regr, np.ones((n,1)), y.ravel(), cv=kf_10, scoring='mean_squared_error').mean()    
mse.append(score)

5つの主要コンポーネントに対してCVを実行し、一度に1つのコンポーネントを回帰に追加します

for i in np.arange(1,6):
    score = -1*cross_validation.cross_val_score(regr, X_reduced[:,:i], y.ravel(), cv=kf_10, scoring='mean_squared_error').mean()
    mse.append(score)

fig, (ax1, ax2) = plt.subplots(1,2, figsize=(12,5))
ax1.plot(mse, '-v')
ax2.plot([1,2,3,4,5], mse[1:6], '-v')
ax2.set_title('Intercept excluded from plot')

for ax in fig.axes:
    ax.set_xlabel('Number of principal components in regression')
    ax.set_ylabel('MSE')
    ax.set_xlim((-0.2,5.2))

Scikit-learn PLS回帰

mse = []

kf_10 = cross_validation.KFold(n, n_folds=10, shuffle=True, random_state=2)

for i in np.arange(1, 6):
    pls = PLSRegression(n_components=i, scale=False)
    pls.fit(scale(X_reduced),y)
    score = cross_validation.cross_val_score(pls, X_reduced, y, cv=kf_10, scoring='mean_squared_error').mean()
    mse.append(-score)

plt.plot(np.arange(1, 6), np.array(mse), '-v')
plt.xlabel('Number of principal components in PLS regression')
plt.ylabel('MSE')
plt.xlim((-0.2, 5.2))

— ジョルディ
ソース

7

これはPythonとNumPyのみのSVD（数年後）です。
（これはSSA / sklearn / pandasに関する質問にはまったく対応していませんが、いつかはpythonistを助けるかもしれません。）

#!/usr/bin/env python2
""" SVD straight up """
# geometry: see http://www.ams.org/samplings/feature-column/fcarc-svd

from __future__ import division
import sys
import numpy as np

__version__ = "2015-06-15 jun  denis-bz-py t-online de"

# from bz.etc import numpyutil as nu
def ints( x ):
    return np.round(x).astype(int)  # NaN Inf -> - maxint

def quantiles( x ):
    return "quantiles %s" % ints( np.percentile( x, [0, 25, 50, 75, 100] ))


#...........................................................................
csvin = "ccheaton-multicollinearity.csv"  # https://gist.github.com/ccheaton/8393329
plot = 0

    # to change these vars in sh or ipython, run this.py  csvin=\"...\"  plot=1  ...
for arg in sys.argv[1:]:
    exec( arg )

np.set_printoptions( threshold=10, edgeitems=10, linewidth=120,
    formatter = dict( float = lambda x: "%.2g" % x ))  # float arrays %.2g

#...........................................................................
yX = np.loadtxt( csvin, delimiter="," )
y = yX[:,0]
X = yX[:,1:]
print "read %s" % csvin
print "y %d  %s" % (len(y), quantiles(y))
print "X %s  %s" % (X.shape, quantiles(X))
print ""

#...........................................................................
U, sing, Vt = np.linalg.svd( X, full_matrices=False )
#...........................................................................

print "SVD: %s -> U %s . sing diagonal . Vt %s" % (
        X.shape, U.shape, Vt.shape )
print "singular values:", ints( sing )
    # % variance (sigma^2) explained != % sigma explained, e.g. 10 1 1 1 1

var = sing**2
var *= 100 / var.sum()
print "% variance ~ sing^2:", var

print "Vt, the right singular vectors  * 100:\n", ints( Vt * 100 )
    # multicollinear: near +- 100 in each row / col

yU = y.dot( U )
yU *= 100 / yU.sum()
print "y ~ these percentages of U, the left singular vectors:", yU

->ログ

# from: test-pca.py
# run: 15 Jun 2015 16:45  in ~bz/py/etc/data/etc  Denis-iMac 10.8.3
# versions: numpy 1.9.2  scipy 0.15.1   python 2.7.6   mac 10.8.3

read ccheaton-multicollinearity.csv
y 373  quantiles [  2823  60336  96392 147324 928560]
X (373, 5)  quantiles [     7     47    247    573 512055]

SVD: (373, 5) -> U (373, 5) . sing diagonal . Vt (5, 5)
singular values: [2537297    4132    2462     592      87]
% variance ~ sing^2: [1e+02 0.00027 9.4e-05 5.4e-06 1.2e-07]
Vt, the right singular vectors  * 100:
[[  0   0 100   0   0]
 [  1  98   0 -12  17]
 [-10 -11   0 -99  -6]
 [  1 -17   0  -4  98]
 [-99   2   0  10   2]]
y ~ these percentages of U, the left singular vectors: [1e+02 15 -18 0.88 -0.57]

— デニス
ソース

私はパーティーに少し遅れましたが、すばらしい答えです

— plumbus_bouquet

3

パイプラインを使用して主成分分析と線形回帰を組み合わせてみます。

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

# Principle components regression
steps = [
    ('scale', StandardScaler()),
    ('pca', PCA()),
    ('estimator', LinearRegression())
]
pipe = Pipeline(steps)
pca = pipe.set_params(pca__n_components=3)
pca.fit(X, y)

— ジョー
ソース

3

私の答えはほぼ5年遅れて来ており、PythonでPCRを実行することに関して、もはや助けを必要としない可能性が十分あります。私たちはhoggormという名前のPythonパッケージを開発しました。これは、当時必要だったものを正確に実行します。こちらのPCRの例をご覧ください。hoggormで計算された結果を視覚化するためのhoggormplotという名前の補足的なプロットパッケージもあります。

— オリ
ソース