scikit-learnを使用して複数のカテゴリに分類します

Question 1

scikit-learnの教師あり学習方法の1つを使用して、テキストを1つ以上のカテゴリに分類しようとしています。私が試したすべてのアルゴリズムの予測関数は、1つの一致を返すだけです。

たとえば、次のようなテキストがあります。

"Theaters in New York compared to those in London"

そして、フィードするすべてのテキストスニペットの場所を選択するようにアルゴリズムをトレーニングしました。

上記の例では、私はそれを返すようにしたいと思うNew YorkとLondon、それだけを返しますNew York。

scikit-learnを使用して複数の結果を返すことは可能ですか？または、次に高い確率でラベルを返しますか？

ご協力いただきありがとうございます。

- -更新

使用してみましたOneVsRestClassifierが、テキストごとに1つのオプションしか返されません。以下は私が使用しているサンプルコードです

y_train = ('New York','London')


train_set = ("new york nyc big apple", "london uk great britain")
vocab = {'new york' :0,'nyc':1,'big apple':2,'london' : 3, 'uk': 4, 'great britain' : 5}
count = CountVectorizer(analyzer=WordNGramAnalyzer(min_n=1, max_n=2),vocabulary=vocab)
test_set = ('nice day in nyc','london town','hello welcome to the big apple. enjoy it here and london too')

X_vectorized = count.transform(train_set).todense()
smatrix2  = count.transform(test_set).todense()


base_clf = MultinomialNB(alpha=1)

clf = OneVsRestClassifier(base_clf).fit(X_vectorized, y_train)
Y_pred = clf.predict(smatrix2)
print Y_pred

結果：['ニューヨーク' 'ロンドン' 'ロンドン']

Question 2

必要なのはマルチラベル分類と呼ばれます。Scikits-learnはそれを行うことができます。ここを参照してください：http：//scikit-learn.org/dev/modules/multiclass.html。

あなたの例で何が問題になっているのかわかりません。私のバージョンのsklearnにはWordNGramAnalyzerがないようです。おそらく、より多くのトレーニング例を使用するか、別の分類器を試すかという問題ですか？ただし、マルチラベル分類子は、ターゲットがタプルのリスト/ラベルのリストであることを想定していることに注意してください。

以下は私のために働きます：

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier

X_train = np.array(["new york is a hell of a town",
                    "new york was originally dutch",
                    "the big apple is great",
                    "new york is also called the big apple",
                    "nyc is nice",
                    "people abbreviate new york city as nyc",
                    "the capital of great britain is london",
                    "london is in the uk",
                    "london is in england",
                    "london is in great britain",
                    "it rains a lot in london",
                    "london hosts the british museum",
                    "new york is great and so is london",
                    "i like london better than new york"])
y_train = [[0],[0],[0],[0],[0],[0],[1],[1],[1],[1],[1],[1],[0,1],[0,1]]
X_test = np.array(['nice day in nyc',
                   'welcome to london',
                   'hello welcome to new york. enjoy it here and london too'])   
target_names = ['New York', 'London']

classifier = Pipeline([
    ('vectorizer', CountVectorizer(min_n=1,max_n=2)),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC()))])
classifier.fit(X_train, y_train)
predicted = classifier.predict(X_test)
for item, labels in zip(X_test, predicted):
    print '%s => %s' % (item, ', '.join(target_names[x] for x in labels))

私にとって、これは出力を生成します：

nice day in nyc => New York
welcome to london => London
hello welcome to new york. enjoy it here and london too => New York, London

お役に立てれば。

Question 3

編集：Python 3、scikit-learn 0.18.1用に更新され、提案されているようにMultiLabelBinarizerを使用します。

私もこれに取り組んでおり、mwvの優れた回答にわずかな機能強化を加えました。バイナリラベルではなくテキストラベルを入力として受け取り、MultiLabelBinarizerを使用してそれらをエンコードします。

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import MultiLabelBinarizer

X_train = np.array(["new york is a hell of a town",
                    "new york was originally dutch",
                    "the big apple is great",
                    "new york is also called the big apple",
                    "nyc is nice",
                    "people abbreviate new york city as nyc",
                    "the capital of great britain is london",
                    "london is in the uk",
                    "london is in england",
                    "london is in great britain",
                    "it rains a lot in london",
                    "london hosts the british museum",
                    "new york is great and so is london",
                    "i like london better than new york"])
y_train_text = [["new york"],["new york"],["new york"],["new york"],["new york"],
                ["new york"],["london"],["london"],["london"],["london"],
                ["london"],["london"],["new york","london"],["new york","london"]]

X_test = np.array(['nice day in nyc',
                   'welcome to london',
                   'london is rainy',
                   'it is raining in britian',
                   'it is raining in britian and the big apple',
                   'it is raining in britian and nyc',
                   'hello welcome to new york. enjoy it here and london too'])
target_names = ['New York', 'London']

mlb = MultiLabelBinarizer()
Y = mlb.fit_transform(y_train_text)

classifier = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC()))])

classifier.fit(X_train, Y)
predicted = classifier.predict(X_test)
all_labels = mlb.inverse_transform(predicted)

for item, labels in zip(X_test, all_labels):
    print('{0} => {1}'.format(item, ', '.join(labels)))

これにより、次の出力が得られます。

nice day in nyc => new york
welcome to london => london
london is rainy => london
it is raining in britian => london
it is raining in britian and the big apple => new york
it is raining in britian and nyc => london, new york
hello welcome to new york. enjoy it here and london too => london, new york

Question 4

私もこれに遭遇しました。私にとっての問題は、y_Trainが文字列のシーケンスではなく文字列のシーケンスであったことでした。どうやら、OneVsRestClassifierは、入力ラベル形式に基づいて、マルチクラスとマルチラベルのどちらを使用するかを決定します。だから変更：

y_train = ('New York','London')

に

y_train = (['New York'],['London'])

すべてのラベルの区切りが同じであるため、これは将来的にはなくなるようです：https：//github.com/scikit-learn/scikit-learn/pull/1987

Question 5

この行を変更して、Pythonの新しいバージョンで機能するようにします

# lb = preprocessing.LabelBinarizer()
lb = preprocessing.MultiLabelBinarizer()

Question 6

いくつかの複数分類の例は以下のとおりです：-

例1：-

import numpy as np
from sklearn.preprocessing import LabelBinarizer
encoder = LabelBinarizer()

arr2d = np.array([1, 2, 3,4,5,6,7,8,9,10,11,12,13,14,1])
transfomed_label = encoder.fit_transform(arr2d)
print(transfomed_label)

出力は

[[1 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 1 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 1 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 1 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 1 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 1 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 1 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 1 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 1 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 1 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 1 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 1 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 1]
 [1 0 0 0 0 0 0 0 0 0 0 0 0 0]]

例2：-

import numpy as np
from sklearn.preprocessing import LabelBinarizer
encoder = LabelBinarizer()

arr2d = np.array(['Leopard','Lion','Tiger', 'Lion'])
transfomed_label = encoder.fit_transform(arr2d)
print(transfomed_label)

出力は

[[1 0 0]
 [0 1 0]
 [0 0 1]
 [0 1 0]]