Sklearnでのトレーニング/テスト/検証セットの分割

59

Sklearnでデータ行列と対応するラベルベクトルをX_train、X_test、X_val、y_train、y_test、y_valにランダムに分割するにはどうすればよいですか？私の知る限りsklearn.cross_validation.train_test_split、3つではなく2つにしか分割できません...

machine-learning scikit-learn

— ヘンドリック
ソース

81

sklearn.model_selection.train_test_split2回使用するだけで済みます。最初に訓練のために分割し、テストしてから、検証と訓練に再び訓練を分割します。このようなもの：

 X_train, X_test, y_train, y_test 
    = train_test_split(X, y, test_size=0.2, random_state=1)

 X_train, X_val, y_train, y_val 
    = train_test_split(X_train, y_train, test_size=0.2, random_state=1)

— hh32
ソース

1

はい、これはもちろん動作しますが、もっとエレガントなものを望んでいました;）気にしないで、私はこの答えを受け入れます。

— ヘンドリック

1

検証セットを使用して最適なハイパーパラメーターを検索する場合は、分割後に以下を実行できます：gist.github.com/albertotb/1bad123363b186267e3aeaa26610b54b

— skd

12

それでは、この例の最終列車、テスト、検証の割合は何ですか？なぜならtrain_test_split 、前の80/20分割でこれを実行しているからです。したがって、valは80％の20％です。この方法では、分割比率はそれほど単純ではありません。

— モニカヘドネック

1

私は@Monica Heddneckに同意します。64％のトレーニング、16％の検証、20％のテストspltがより明確になる可能性があります。このソリューションを使用すると、面倒な推論が必要になります。

— ペリー

32

numpyとpandasを使用するSOについては、この質問に対する素晴らしい回答があります。

コマンド（議論の答えを参照）：

train, validate, test = np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])

トレーニング、検証、テストセット用に60％、20％、20％のスプリットを生成します。

— 0_0
ソース

2

私は.660％の意味を見ることができます...しかし、.8どういう意味ですか？

— トム・ヘイル

1

@TomHale np.splitは、シャッフルされた配列の長さの60％で分割し、次に長さの80％（データの追加20％）で分割され、残りの20％のデータが残ります。これは、関数の定義によるものです。次でテスト/プレイできます：x = np.arange(10.0)、その後にnp.split(x, [ int(len(x)*0.6), int(len(x)*0.8)])

— 0_0

3

ほとんどの場合、1回分割するのではなく、最初のステップでトレーニングセットとテストセットでデータを分割します。続いて、「split k-fold」または「leave-one-out（LOO）」アルゴリズムを使用した交差検証など、より複雑な分割を組み込んだパラメーター検索を実行します。

— JLT
ソース

3

train_test_split2回使用できます。これは最も簡単だと思います。

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, test_size=0.25, random_state=1)

このようにして、train、val、testセットは、それぞれ60％、20％、データセットの20％であろう。

— デビッド・ユング
ソース

2

上記の最良の答えtrain_test_splitは、パーティションサイズを変更しないで2回分離しても、最初に意図したパーティションが得られないことは言及していません。

x_train, x_remain = train_test_split(x, test_size=(val_size + test_size))

次に、x_remainの検証およびテストセットの部分が変更され、次のようにカウントできます。

new_test_size = np.around(test_size / (val_size + test_size), 2)
# To preserve (new_test_size + new_val_size) = 1.0 
new_val_size = 1.0 - new_test_size

x_val, x_test = train_test_split(x_remain, test_size=new_test_size)

この場合、すべての初期パーティションが保存されます。

— アメトフ
ソース

1

別のアプローチを次に示します（等しい3分割を想定）。

# randomly shuffle the dataframe
df = df.reindex(np.random.permutation(df.index))

# how many records is one-third of the entire dataframe
third = int(len(df) / 3)

# Training set (the top third from the entire dataframe)
train = df[:third]

# Testing set (top half of the remainder two third of the dataframe)
test = df[third:][:third]

# Validation set (bottom one third)
valid = df[-third:]

これはより簡潔にすることができますが、説明のために冗長に保ちました。

— ヴィシャール
ソース

0

与えられた場合train_frac=0.8、この関数は80％/ 10％/ 10％の分割を作成します：

import sklearn

def data_split(examples, labels, train_frac, random_state=None):
    ''' https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
    param data:       Data to be split
    param train_frac: Ratio of train set to whole dataset

    Randomly split dataset, based on these ratios:
        'train': train_frac
        'valid': (1-train_frac) / 2
        'test':  (1-train_frac) / 2

    Eg: passing train_frac=0.8 gives a 80% / 10% / 10% split
    '''

    assert train_frac >= 0 and train_frac <= 1, "Invalid training set fraction"

    X_train, X_tmp, Y_train, Y_tmp = sklearn.model_selection.train_test_split(
                                        examples, labels, train_size=train_frac, random_state=random_state)

    X_val, X_test, Y_val, Y_test   = sklearn.model_selection.train_test_split(
                                        X_tmp, Y_tmp, train_size=0.5, random_state=random_state)

    return X_train, X_val, X_test,  Y_train, Y_val, Y_test

— トム・ヘイル
ソース

0

（75、15、10）などの事前定義された比率を尊重しながら、@ hh32の回答に追加します。

train_ratio = 0.75
validation_ratio = 0.15
test_ratio = 0.10

# train is now 75% of the entire data set
# the _junk suffix means that we drop that variable completely
x_train, x_test, y_train, y_test = train_test_split(dataX, dataY, test_size=1 - train_ratio)

# test is now 10% of the initial data set
# validation is now 15% of the initial data set
x_val, x_test, y_val, y_test = train_test_split(x_test, y_test, test_size=test_ratio/(test_ratio + validation_ratio)) 

print(x_train, x_val, x_test)

— アンドレイ・フロレア
ソース

0

保存された比率での@ hh32の回答の拡張。

# Defines ratios, w.r.t. whole dataset.
ratio_train = 0.8
ratio_val = 0.1
ratio_test = 0.1

# Produces test split.
x_remaining, x_test, y_remaining, y_test = train_test_split(
    x, y, test_size=test_ratio)

# Adjusts val ratio, w.r.t. remaining dataset.
ratio_remaining = 1 - ratio_test
ratio_val_adjusted = ratio_val / ratio_remaining

# Produces train and val splits.
x_train, x_val, y_train, y_val = train_test_split(
    x_remaining, y_remaining, test_size=ratio_val_adjusted)

最初の分割後に残りのデータセットが削減されるため、削減されたデータセットに関する新しい比率は、方程式を解いて計算する必要があります。

$R_{remaining} \cdot R_{new} = R_{old}$

— ホルヘ・バリオス
ソース