是否可以从 Python 中的句子语料库重新训练 word2vec 模型(例如 GoogleNews-vectors-negative300.bin)? [英] Is it possible to re-train a word2vec model (e.g. GoogleNews-vectors-negative300.bin) from a corpus of sentences in python?

查看:26
本文介绍了是否可以从 Python 中的句子语料库重新训练 word2vec 模型(例如 GoogleNews-vectors-negative300.bin)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用预先训练的谷歌新闻数据集通过在 python 中使用 Gensim 库来获取词向量

I am using pre-trained Google news dataset for getting word vectors by using Gensim library in python

model = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

加载模型后,我将训练评论句子词转换为向量

After loading the model I am converting training reviews sentence words into vectors

#reading all sentences from training file
with open('restaurantSentences', 'r') as infile:
x_train = infile.readlines()
#cleaning sentences
x_train = [review_to_wordlist(review,remove_stopwords=True) for review in x_train]
train_vecs = np.concatenate([buildWordVector(z, n_dim) for z in x_train])

在 word2Vec 过程中,我的语料库中的单词有很多错误,这些错误不在模型中.问题是我如何重新训练已经预先训练好的模型(例如 GoogleNews-vectors-negative300.bin'),以便为那些丢失的词获取词向量.

During word2Vec process i get a lot of errors for the words in my corpus, that are not in the model. Problem is how can i retrain already pre-trained model (e.g GoogleNews-vectors-negative300.bin'), in order to get word vectors for those missing words.

以下是我尝试过的:从我拥有的训练句子中训练出一个新模型

Following is what I have tried: Trained a new model from training sentences that I had

# Set values for various parameters
num_features = 300    # Word vector dimensionality                      
min_word_count = 10   # Minimum word count                        
num_workers = 4       # Number of threads to run in parallel
context = 10          # Context window    size                                                                                    
downsampling = 1e-3   # Downsample setting for frequent words

sentences = gensim.models.word2vec.LineSentence("restaurantSentences")
# Initialize and train the model (this will take some time)
print "Training model..."
model = gensim.models.Word2Vec(sentences, workers=num_workers,size=num_features, min_count = min_word_count, 
                      window = context, sample = downsampling)


model.build_vocab(sentences)
model.train(sentences)
model.n_similarity(["food"], ["rice"])

成功了!但问题是我有一个非常小的数据集和更少的资源来训练一个大模型.

It worked! but the problem is that I have a really small dataset and less resources to train a large model.

我正在考虑的第二种方法是扩展已经训练好的模型,例如 GoogleNews-vectors-negative300.bin.

Second way that I am looking at is to extend the already trained model such as GoogleNews-vectors-negative300.bin.

model = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
sentences = gensim.models.word2vec.LineSentence("restaurantSentences")
model.train(sentences)

有没有可能,这是一个很好的使用方法,请帮帮我

Is it possible and is that a good way to use, please help me out

推荐答案

这是我在技术上解决问题的方法:

This is how I technically solved the issue:

使用 Radim Rehurek 的可迭代句子准备数据输入:https://rare-technologies.com/word2vec-教程/

Preparing data input with sentence iterable from Radim Rehurek: https://rare-technologies.com/word2vec-tutorial/

sentences = MySentences('newcorpus')  

设置模型

model = gensim.models.Word2Vec(sentences)

将词汇表与谷歌词向量相交

Intersecting the vocabulary with the google word vectors

model.intersect_word2vec_format('GoogleNews-vectors-negative300.bin',
                                lockf=1.0,
                                binary=True)

最终执行模型并更新

model.train(sentences)

警告:从实质性的角度来看,一个可能很小的语料库是否真的可以改进"在海量语料库上训练的谷歌词向量当然是非常值得商榷的......

A note of warning: From a substantive point of view, it is of course highly debatable whether a corpus likely to be very small can actually "improve" the Google wordvectors trained on a massive corpus...

这篇关于是否可以从 Python 中的句子语料库重新训练 word2vec 模型(例如 GoogleNews-vectors-negative300.bin)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆