事前に訓練されたモデルの重みで新しいword2vecモデルを初期化する方法は？

13

word2vectorモデルの使用とトレーニングにPythonのGensimライブラリを使用しています。最近、（GoogleNewDataset事前学習済みモデル）などの事前学習済みのword2vecモデルでモデルの重みを初期化することを検討していました。私は数週間それと格闘してきました。さて、私はgesimに、事前に訓練されたモデルの重みでモデルの重みを初期化するのに役立つ関数があることを調べました。以下に説明します。

reset_from(other_model)

    Borrow shareable pre-built structures (like vocab) from the other_model. Useful if testing multiple models in parallel on the same corpus.

この関数で同じことができるかどうかはわかりません。助けてください！！！

— ノミルクス
ソース

モデルの語彙は同じですか？

— ヒマバルシャ

実行ごとにランダムに生成された数値で各word2vecパラメーターを開始しないのはなぜですか？これを行うことができ、各パラメーター（numFeatures、contextWindow、seed）の乱数を慎重に選択することで、ユースケースに必要なランダムな類似タプルを取得できました。アンサンブルアーキテクチャのシミュレーション。他の人はそれについてどう思いますか？Plsは返信します。

— ゾーゼ

18

アビシェークありがとう私はそれを理解しました！これが私の実験です。

1）。簡単な例をプロットします。

from gensim.models import Word2Vec
from sklearn.decomposition import PCA
from matplotlib import pyplot
# define training data
sentences = [['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec'],
            ['this', 'is', 'the', 'second', 'sentence'],
            ['yet', 'another', 'sentence'],
            ['one', 'more', 'sentence'],
            ['and', 'the', 'final', 'sentence']]
# train model
model_1 = Word2Vec(sentences, size=300, min_count=1)

# fit a 2d PCA model to the vectors
X = model_1[model_1.wv.vocab]
pca = PCA(n_components=2)
result = pca.fit_transform(X)
# create a scatter plot of the projection
pyplot.scatter(result[:, 0], result[:, 1])
words = list(model_1.wv.vocab)
for i, word in enumerate(words):
    pyplot.annotate(word, xy=(result[i, 0], result[i, 1]))
pyplot.show()

上記のプロットから、簡単な文では距離によって異なる単語の意味を区別できないことがわかります。

2）。事前学習済みの単語の埋め込みを読み込みます：

from gensim.models import KeyedVectors

model_2 = Word2Vec(size=300, min_count=1)
model_2.build_vocab(sentences)
total_examples = model_2.corpus_count
model = KeyedVectors.load_word2vec_format("glove.6B.300d.txt", binary=False)
model_2.build_vocab([list(model.vocab.keys())], update=True)
model_2.intersect_word2vec_format("glove.6B.300d.txt", binary=False, lockf=1.0)
model_2.train(sentences, total_examples=total_examples, epochs=model_2.iter)

# fit a 2d PCA model to the vectors
X = model_2[model_1.wv.vocab]
pca = PCA(n_components=2)
result = pca.fit_transform(X)
# create a scatter plot of the projection
pyplot.scatter(result[:, 0], result[:, 1])
words = list(model_1.wv.vocab)
for i, word in enumerate(words):
    pyplot.annotate(word, xy=(result[i, 0], result[i, 1]))
pyplot.show()

上記の図から、単語の埋め込みがより意味のあることがわかります。
この回答が役立つことを願っています。

— シシャン・ワン
ソース

1

この答えは非常に有益であり、モデルをvecファイルに埋め込むのに役立ちます。

— アカシュカンドパル

@ harrypotter0 Thx！

— シシャンワン

きちんとした明確な仲間！

— ヴィジェイathithya

これを使用しようとしたとき、2つの同一のデータセットでテストしました。結果はモデルごとに異なりました。私は同じ初期化された重みから始めるので、モデルはその後同じになることを望んでいました。なぜそうではなかったのですか？

— エリックウィナー

1

@EricWienerトレーニングデータセットが同じであっても、各トレーニングの単語ベクトルはランダムであるためです。同じデータセットで計算された単語ベクトル空間は類似している必要があり、NLPタスクで使用されるパフォーマンスも類似している必要があります。

— Shixiangワン

4

サンプルコードを見てみましょう。

>>>from gensim.models import word2vec

#let us train a sample model like yours
>>>sentences = [['first', 'sentence'], ['second', 'sentence']]
>>>model1 = word2vec.Word2Vec(sentences, min_count=1)

#let this be the model from which you want to reset
>>>sentences = [['third', 'sentence'], ['fourth', 'sentence']]
>>>model2 = word2vec.Word2Vec(sentences, min_count=1)
>>>model1.reset_from(model2)
>>>model1.similarity('third','sentence')
-0.064622000988260417

したがって、model1がmodel2によってリセットされているため、「3番目」と「文」という語が最終的にその類似性を示す語彙に含まれていることがわかります。これは基本的な使用方法です。reset_weights（）をチェックして、ウェイトを未トレーニング/初期状態にリセットすることもできます。

— ヒマ・バルシャ
ソース

2

単語埋め込み用の事前に訓練されたネットを探しているなら、GloVeをお勧めします。Kerasの次のブログは、これを実装する方法について非常に有益です。また、事前にトレーニングされたGloVe埋め込みへのリンクもあります。50次元ベクトルから300次元ベクトルの範囲の事前学習済みの単語ベクトルがあります。これらは、Wikipedia、Common Crawl Data、またはTwitterデータのいずれかに基づいて構築されました。ここからダウンロードできます：http : //nlp.stanford.edu/projects/glove/。さらに、kerasブログでそれらの実装方法を調べる必要があります。https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html

— サミュエル・シャーマン
ソース

1

私はここでそれをやった：https : //gist.github.com/AbhishekAshokDubey/054af6f92d67d5ef8300fac58f59fcc9

これが必要なものかどうかを確認します

— アビシェーク
ソース