入れ子の交差検証と最適な回帰モデルの選択-これは正しいSKLearnプロセスですか？

8

正しく理解していれば、入れ子になったCVは、どのモデルとハイパーパラメーターのチューニングプロセスが最適かを評価するのに役立ちます。内側のループ（GridSearchCV）は最適なハイパーパラメーターを見つけ、外側のループ（）はハイパーパラメーターcross_val_score調整アルゴリズムを評価します。次にmse、最終的なモデルテストで最小化する（回帰分類器を調べている）外側のループから、どのチューニング/モデルコンボを選択するかを決定します。

ネストされた相互検証に関する質問/回答を読みましたが、これを利用する完全なパイプラインの例を見たことはありません。それで、以下の私のコード（実際のハイパーパラメータ範囲は無視してください-これは単なる例です）と思考プロセスは理にかなっていますか？

from sklearn.cross_validation import cross_val_score, train_test_split
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.datasets import make_regression

# create some regression data
X, y = make_regression(n_samples=1000, n_features=10)
params = [{'C':[0.01,0.05,0.1,1]},{'n_estimators':[10,100,1000]}]

# setup models, variables
mean_score = []
models = [SVR(), RandomForestRegressor()]

# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.3)

# estimate performance of hyperparameter tuning and model algorithm pipeline
for idx, model in enumerate(models):
    clf = GridSearchCV(model, params[idx], scoring='mean_squared_error')

    # this performs a nested CV in SKLearn
    score = cross_val_score(clf, X_train, y_train, scoring='mean_squared_error')

    # get the mean MSE across each fold
    mean_score.append(np.mean(score))
    print('Model:', model, 'MSE:', mean_score[-1])

# estimate generalization performance of the best model selection technique
best_idx = mean_score.index(max(mean_score)) # because SKLearn flips MSE signs, max works OK here
best_model = models[best_idx]

clf_final = GridSearchCV(best_model, params[best_idx])
clf_final.fit(X_train, y_train)

y_pred = clf_final.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print('Final Model': best_model, 'Final model RMSE:', rmse)

— ボビー・ジョンソンOG
ソース

8

Yoursはネストされた交差検証の例ではありません。

入れ子の交差検証は、たとえば、ランダムフォレストとSVMのどちらが問題に適しているかを判断するのに役立ちます。ネストされたCVはスコアのみを出力し、コードのようなモデルは出力しません。

これはネストされた相互検証の例です。

from sklearn.datasets import load_boston
from sklearn.cross_validation import KFold
from sklearn.metrics import mean_squared_error
from sklearn.grid_search import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
import numpy as np

params = [{'C': [0.01, 0.05, 0.1, 1]}, {'n_estimators': [10, 100, 1000]}]
models = [SVR(), RandomForestRegressor()]

df = load_boston()
X = df['data']
y = df['target']

cv = [[] for _ in range(len(models))]
for tr, ts in KFold(len(X)):
    for i, (model, param) in enumerate(zip(models, params)):
        best_m = GridSearchCV(model, param)
        best_m.fit(X[tr], y[tr])
        s = mean_squared_error(y[ts], best_m.predict(X[ts]))
        cv[i].append(s)
print(np.mean(cv, 1))

ところで、いくつかの考え：

n_estimatorsランダムフォレストをグリッド検索する目的はありません。明らかに、より多くの、より楽しい。のようなものmax_depthは、最適化したい種類の正則化です。ネストされたCVのエラーは、RandomForest適切なハイパーパラメーターを最適化しなかったため、はるかに高くなりました。
グラディエントブースティングツリーを試すこともできます。

— リカルド・クルス
ソース

それをありがとう。私の目標は、あなたが言ったことを正確に行うことです-私の問題にどの分類アルゴリズムが最も適しているかを理解してください。私はSKLearnのドキュメントに関して混乱していると思います：scikit-learn.org/stable/tutorial/statistical_inference/… （「入れ子の交差検証」の下）

— BobbyJohnsonOG

最適に選択されたモデルのパフォーマンスをテストするために、データセット全体で最終的な相互検証を実行できますか？または、ネストされたCVの前にデータセットをトレイン/テストに分割し、トレインでネストされたCVを実行してから、トレインデータに最適なモデルを適合させ、テストでテストする必要がありますか？

— BobbyJohnsonOG

コメント弾幕でごめんなさい。したがって、私の最終的なモデルは次のようになります

best_idx = np.where(np.mean(cv,1).min())[0]; final_m = GridSearchCV(models[best_idx], params[best_idx]); final_m.fit(X,y)

— 。– BobbyJohnsonOG

あなたが言ったことを元に、これが組み込みのSKLearn関数で私がしようとしていたことです（あなたの答えと同じです）：

for model, param in zip(models, params):     clf = GridSearchCV(model, param)     my_score = cross_val_score(clf, X, y, scoring='mean_squared_error')     my_scores.append(my_score)

— BobbyJohnsonOG

7

ネストされた交差検証はモデルの汎化誤差を推定するため、候補モデルとそれに関連するパラメーターグリッドのリストから最適なモデルを選択するのに良い方法です。元の投稿はネストされたCVの実行に近いものです。単一の列車テスト分割を行うのではなく、代わりに2番目の交差検証スプリッターを使用する必要があります。つまり、「外部」相互検証スプリッター内に「内部」相互検証スプリッターを「ネスト」します。

内部クロス検証スプリッターは、ハイパーパラメーターを選択するために使用されます。外側の相互検証スプリッターは、複数の列車テスト分割でのテストエラーを平均化します。複数の列車テスト分割で汎化誤差を平均化すると、目に見えないデータに対するモデルの精度をより信頼性高く推定できます。

元の投稿のコードを変更して最新バージョンsklearn（にsklearn.cross_validation置き換えられsklearn.model_selection、に'mean_squared_error'置き換えられた'neg_mean_squared_error'）に更新し、2つのKFold交差検証スプリッターを使用して最適なモデルを選択しました。ネストされた交差検証の詳細については、のネストされた交差検証sklearnの例を参照してください。

from sklearn.model_selection import KFold, cross_val_score, GridSearchCV
from sklearn.datasets import make_regression
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
import numpy as np

# `outer_cv` creates 3 folds for estimating generalization error
outer_cv = KFold(3)

# when we train on a certain fold, we use a second cross-validation
# split in order to choose hyperparameters
inner_cv = KFold(3)

# create some regression data
X, y = make_regression(n_samples=1000, n_features=10)

# give shorthand names to models and use those as dictionary keys mapping
# to models and parameter grids for that model
models_and_parameters = {
    'svr': (SVR(),
            {'C': [0.01, 0.05, 0.1, 1]}),
    'rf': (RandomForestRegressor(),
           {'max_depth': [5, 10, 50, 100, 200, 500]})}

# we will collect the average of the scores on the 3 outer folds in this dictionary
# with keys given by the names of the models in `models_and_parameters`
average_scores_across_outer_folds_for_each_model = dict()

# find the model with the best generalization error
for name, (model, params) in models_and_parameters.items():
    # this object is a regressor that also happens to choose
    # its hyperparameters automatically using `inner_cv`
    regressor_that_optimizes_its_hyperparams = GridSearchCV(
        estimator=model, param_grid=params,
        cv=inner_cv, scoring='neg_mean_squared_error')

    # estimate generalization error on the 3-fold splits of the data
    scores_across_outer_folds = cross_val_score(
        regressor_that_optimizes_its_hyperparams,
        X, y, cv=outer_cv, scoring='neg_mean_squared_error')

    # get the mean MSE across each of outer_cv's 3 folds
    average_scores_across_outer_folds_for_each_model[name] = np.mean(scores_across_outer_folds)
    error_summary = 'Model: {name}\nMSE in the 3 outer folds: {scores}.\nAverage error: {avg}'
    print(error_summary.format(
        name=name, scores=scores_across_outer_folds,
        avg=np.mean(scores_across_outer_folds)))
    print()

print('Average score across the outer folds: ',
      average_scores_across_outer_folds_for_each_model)

many_stars = '\n' + '*' * 100 + '\n'
print(many_stars + 'Now we choose the best model and refit on the whole dataset' + many_stars)

best_model_name, best_model_avg_score = max(
    average_scores_across_outer_folds_for_each_model.items(),
    key=(lambda name_averagescore: name_averagescore[1]))

# get the best model and its associated parameter grid
best_model, best_model_params = models_and_parameters[best_model_name]

# now we refit this best model on the whole dataset so that we can start
# making predictions on other data, and now we have a reliable estimate of
# this model's generalization error and we are confident this is the best model
# among the ones we have tried
final_regressor = GridSearchCV(best_model, best_model_params, cv=inner_cv)
final_regressor.fit(X, y)

print('Best model: \n\t{}'.format(best_model), end='\n\n')
print('Estimation of its generalization error (negative mean squared error):\n\t{}'.format(
    best_model_avg_score), end='\n\n')
print('Best parameter choice for this model: \n\t{params}'
      '\n(according to cross-validation `{cv}` on the whole dataset).'.format(
      params=final_regressor.best_params_, cv=inner_cv))

— チャーリー・ブルミット
ソース

最後のコメントで、「...トレーニングセット全体でこの最高のモデルを修正する」と言いましたが、実際にはデータセット全体で実行します（Xおよびy）。私が理解している限り、これは正しいことですが、コメントを修正する必要があります。どう思いますか？

— Dror Atariah 2017

それをキャッチしてくれた@DrorAtariahに感謝します。あなたが正しい。それを私が直した。

— チャーリーブルミット2017

1

いらないよ

# this performs a nested CV in SKLearn
score = cross_val_score(clf, X_train, y_train, scoring='mean_squared_error')

GridSearchCVあなたのためにこれを行います。グリッド検索プロセスを直感的に理解するには、 GridSearchCV(... , verbose=3)

各フォールドのスコアを抽出するには、scikit-learnドキュメントのこの例を参照してください

— ラネノク
ソース

グリッド検索はハイパーパラメータを最適化するためだけのものだと思いましたか？グリッドサーチを他の何かと組み合わせて使用して、最良の分類アルゴリズム（つまり、SVR対RandomForest）を見つけるにはどうすればよいですか？

— BobbyJohnsonOG

はい。ハイパーパラメーターの各組み合わせについて、GridSearchCVはフォールドを作成し、左側のデータのスコア（ケースでは平均二乗誤差）を計算します。したがって、ハイパーパラメータの各組み合わせは、独自の平均スコアを取得します。「最適化」とは、平均スコアが最高の組み合わせを選択することです。これらの平均スコアを抽出して、さまざまなモデルについて直接比較できます。

— lanenok