データセットを、たとえば相互検証のためにトレーニングおよびテストデータセットに分割/分割する方法は？

99

NumPy配列をランダムにトレーニングおよびテスト/検証データセットに分割する良い方法は何ですか？Matlab のcvpartitionor crossvalind関数に似たもの。

— エリック
ソース

125

データセットを2つに分割する場合は、を使用できますnumpy.random.shuffle。またはnumpy.random.permutation、インデックスを追跡する必要がある場合は、次のようにします。

import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
numpy.random.shuffle(x)
training, test = x[:80,:], x[80:,:]

または

import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
indices = numpy.random.permutation(x.shape[0])
training_idx, test_idx = indices[:80], indices[80:]
training, test = x[training_idx,:], x[test_idx,:]

相互検証のために同じデータセットを繰り返し分割する多くの方法があります。1つの方法は、データセットから繰り返しサンプリングを行うことです。

import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
training_idx = numpy.random.randint(x.shape[0], size=80)
test_idx = numpy.random.randint(x.shape[0], size=20)
training, test = x[training_idx,:], x[test_idx,:]

最後に、sklearnにはいくつかの相互検証メソッド（k-fold、leave-n-outなど）が含まれています。また、一部の機能に関してバランスの取れたデータのパーティションを作成する、より高度な「層別サンプリング」メソッドも含まれています。たとえば、トレーニングとテストセットで正と負の例の比率が同じになるようにします。

— Pberkes
ソース

13

これらのソリューションに感謝します。しかし、randintを使用する最後の方法は、テストセットとトレーニングセットの両方に同じインデックスを与える可能性が高いのではないでしょうか。

— ggauravr 2013年

3

2番目のソリューションは有効な回答ですが、1番目と3番目のソリューションは無効です。1番目のソリューションでは、データセットをシャッフルすることは必ずしも選択肢ではなく、データ入力の順序を維持する必要がある場合が多くあります。そして3番目のものは、テストとトレーニングのために同じインデックスを非常によく生成できます（@ggauravrによって指摘されています）。

— pedram bashiri

あなたはすべきではない、あなたのクロスバリデーションセットのリサンプリング。全体のアイデアは、CVセットがあなたのアルゴによって以前に見られたことがないということです。トレーニングセットとテストセットはデータの適合に使用されるので、CVセットに含めると、もちろん良い結果が得られます。2つ目の解決策が必要だったので、この回答に賛成したいのですが、この回答には問題があります。

— RubberDuck

55

scikit-learnの使用を伴う別のオプションがあります。以下のようscikitのウィキが記述する、あなただけの次の手順を使用することができます。

from sklearn.model_selection import train_test_split

data, labels = np.arange(10).reshape((5, 2)), range(5)

data_train, data_test, labels_train, labels_test = train_test_split(data, labels, test_size=0.20, random_state=42)

このようにして、トレーニングとテストに分割しようとしているデータのラベルを常に同期させることができます。

— パウロ・マルバー
ソース

1

これは、列車セットとラベルの両方を現実的に扱うため、非常に実用的な答えです。

— chinnychinchin

38

ただのメモ。トレーニング、テスト、および検証セットが必要な場合は、これを行うことができます。

from sklearn.cross_validation import train_test_split

X = get_my_X()
y = get_my_y()
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
x_test, x_val, y_test, y_val = train_test_split(x_test, y_test, test_size=0.5)

これらのパラメーターは、70％をトレーニングに、15％をテストおよびvalセットに提供します。お役に立てれば。

— オフホワイト
ソース

5

おそらくこれをコードに追加するfrom sklearn.cross_validation import train_test_split必要があります。使用しているモジュールを明確にするため

— Radix

これはランダムである必要がありますか？

— liang

つまり、XとYの所定の順序に従って分割することは可能ですか？

— リャン2017年

1

@liangいいえ、ランダムである必要はありません。トレーニング、テスト、および検証セットのサイズは、データセット全体のサイズのa、b、およびcパーセントになるだけです。さんは言わせa=0.7、b=0.15、c=0.15、とd = dataset、N=len(dataset)、その後x_train = dataset[0:int(a*N)]、x_test = dataset[int(a*N):int((a+b)*N)]、とx_val = dataset[int((a+b)*N):]。

— offwhitelotus 2017年

1

非推奨：stackoverflow.com/a/34844352/4237080、使用from sklearn.model_selection import train_test_split

— briennakh

14

sklearn.cross_validationモジュールは廃止されました、あなたが使用することができます。

import numpy as np
from sklearn.model_selection import train_test_split
X, y = np.arange(10).reshape((5, 2)), range(5)

X_trn, X_tst, y_trn, y_tst = train_test_split(X, y, test_size=0.2, random_state=42)

— マシェイ
ソース

5

また、トレーニングとテストセットへの層別化を検討することもできます。Startified Divisionでは、トレーニングとテストのセットもランダムに生成されますが、元のクラスの比率が維持されます。これにより、トレーニングセットとテストセットが元のデータセットのプロパティをより適切に反映します。

import numpy as np  

def get_train_test_inds(y,train_proportion=0.7):
    '''Generates indices, making random stratified split into training set and testing sets
    with proportions train_proportion and (1-train_proportion) of initial sample.
    y is any iterable indicating classes of each observation in the sample.
    Initial proportions of classes inside training and 
    testing sets are preserved (stratified sampling).
    '''

    y=np.array(y)
    train_inds = np.zeros(len(y),dtype=bool)
    test_inds = np.zeros(len(y),dtype=bool)
    values = np.unique(y)
    for value in values:
        value_inds = np.nonzero(y==value)[0]
        np.random.shuffle(value_inds)
        n = int(train_proportion*len(value_inds))

        train_inds[value_inds[:n]]=True
        test_inds[value_inds[n:]]=True

    return train_inds,test_inds

y = np.array([1,1,2,2,3,3])
train_inds,test_inds = get_train_test_inds(y,train_proportion=0.5)
print y[train_inds]
print y[test_inds]

このコードは以下を出力します：

[1 2 3]
[1 2 3]

— Apogentus
ソース

ありがとうございました！命名は多少誤解を招く可能性があり、value_inds真のインデックスですが、出力はインデックスではなく、マスクのみです。

— greenoldman 2017

1

私は自分のプロジェクトがこれを行うための関数を作成しました（ただし、numpyは使用していません）。

def partition(seq, chunks):
    """Splits the sequence into equal sized chunks and them as a list"""
    result = []
    for i in range(chunks):
        chunk = []
        for element in seq[i:len(seq):chunks]:
            chunk.append(element)
        result.append(chunk)
    return result

チャンクをランダム化したい場合は、渡す前にリストをシャッフルするだけです。

— コリン
ソース

0

これは、データを層状にn = 5分割するコードです。

% X = data array
% y = Class_label
from sklearn.cross_validation import StratifiedKFold
skf = StratifiedKFold(y, n_folds=5)
for train_index, test_index in skf:
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

— プラシャント
ソース

0

答えてくれてありがとうございます。（1）サンプリング中の置換（2）重複インスタンスがトレーニングとテストの両方で発生しないように変更しました。

training_idx = np.random.choice(X.shape[0], int(np.round(X.shape[0] * 0.8)),replace=False)
training_idx = np.random.permutation(np.arange(X.shape[0]))[:np.round(X.shape[0] * 0.8)]
    test_idx = np.setdiff1d( np.arange(0,X.shape[0]), training_idx)

— ザーラン
ソース

0

いくつかの読み取りを行い、トレーニングおよびテストするためにデータを分割する（多くの..）さまざまな方法を考慮した後、時間測定することにしました！

私は4つの異なる方法を使用しました（それらのどれもがライブラリsklearnを使用していないため、適切に設計およびテストされたコードであれば、最良の結果が確実に得られます）。

行列arr全体をシャッフルし、データを分割してトレーニングとテストを行います
インデックスをシャッフルし、xとyを割り当ててデータを分割します
方法2と同じですが、より効率的な方法で実行します
パンダのデータフレームを使用して分割する

方式3は、方式1と方式2と方式4が非常に非効率的であることが判明した後、最短の時間で勝ちました。

私が計った4つの異なる方法のコード：

import numpy as np
arr = np.random.rand(100, 3)
X = arr[:,:2]
Y = arr[:,2]
spl = 0.7
N = len(arr)
sample = int(spl*N)

#%% Method 1:  shuffle the whole matrix arr and then split
np.random.shuffle(arr)
x_train, x_test, y_train, y_test = X[:sample,:], X[sample:, :], Y[:sample, ], Y[sample:,]

#%% Method 2: shuffle the indecies and then shuffle and apply to X and Y
train_idx = np.random.choice(N, sample)
Xtrain = X[train_idx]
Ytrain = Y[train_idx]

test_idx = [idx for idx in range(N) if idx not in train_idx]
Xtest = X[test_idx]
Ytest = Y[test_idx]

#%% Method 3: shuffle indicies without a for loop
idx = np.random.permutation(arr.shape[0])  # can also use random.shuffle
train_idx, test_idx = idx[:sample], idx[sample:]
x_train, x_test, y_train, y_test = X[train_idx,:], X[test_idx,:], Y[train_idx,], Y[test_idx,]

#%% Method 4: using pandas dataframe to split
import pandas as pd
df = pd.read_csv(file_path, header=None) # Some csv file (I used some file with 3 columns)

train = df.sample(frac=0.7, random_state=200)
test = df.drop(train.index)

また、1000回のループを3回繰り返すうちに実行される最小時間は次のとおりです。

方法1：0.35883826200006297秒
方法2：1.7157016959999964秒
方法3：1.7876616719995582秒
方法4：0.07562861499991413秒

お役に立てば幸いです。

— 回転
ソース

0

おそらく、トレーニングとテストに分割するだけでなく、モデルが一般化されていることを確認するために相互検証も行う必要があります。ここでは、70％のトレーニングデータ、20％の検証、10％のホールドアウト/テストデータを想定しています。

np.splitを確認してください。

indices_or_sectionsが並べ替えられた整数の1次元配列である場合、エントリは、軸に沿って配列が分割される場所を示します。たとえば、[2、3]は、axis = 0の場合、結果は

ary [：2] ary [2：3] ary [3：]

t, v, h = np.split(df.sample(frac=1, random_state=1), [int(0.7*len(df)), int(0.9*len(df))])

— B.Mr.W.
ソース

0

列車テストに分割して有効

x =np.expand_dims(np.arange(100), -1)


print(x)

indices = np.random.permutation(x.shape[0])

training_idx, test_idx, val_idx = indices[:int(x.shape[0]*.9)], indices[int(x.shape[0]*.9):int(x.shape[0]*.95)],  indices[int(x.shape[0]*.9):int(x.shape[0]*.95)]


training, test, val = x[training_idx,:], x[test_idx,:], x[val_idx,:]

print(training, test, val)

— ラジャト・スブラ・ボウミック
ソース