更新gensim Doc2Vec模型的培训文档 [英] Updating training documents for gensim Doc2Vec model

查看:187
本文介绍了更新gensim Doc2Vec模型的培训文档的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个现成的gensim Doc2Vec模型,并且我正在尝试对训练集和模型进行迭代更新.

I have an existing gensim Doc2Vec model, and I'm trying to do iterative updates to the training set, and by extension, the model.

我拿起新文件,并像往常一样执行预处理:

I take the new documents, and perform preproecssing as normal:

stoplist = nltk.corpus.stopwords.words('english')
train_corpus= []
for i, document in enumerate(corpus_update['body'].values.tolist()):
     train_corpus.append(gensim.models.doc2vec.TaggedDocument([word for word in gensim.utils.simple_preprocess(document) if word not in stoplist], [i]))

然后我加载原始模型,更新词汇表,然后重新训练:

I then load the original model, update the vocabulary, and retrain:

#### Original model
## model = gensim.models.doc2vec.Doc2Vec(dm=0, size=300, hs=1, min_count=10, dbow_words= 1, negative=5, workers=cores)

model = Doc2Vec.load('pvdbow_model_6_06_12_17.doc2vec')

model.build_vocab(train_corpus, update=True)

model.train(train_corpus, total_examples=model.corpus_count, epochs=model.iter)

然后我通过添加新数据来更新训练集Pandas数据框,并重置索引.

I then update the training set Pandas dataframe by appending the new data, and reset the index.

corpus = corpus.append(corpus_update)
corpus = corpus.reset_index(drop=True)

但是,当我尝试在 updated 模型中使用infer_vector()时:

However, when I try to use infer_vector() with the updated model:

inferred_vector = model1.infer_vector(tokens)
sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))

结果质量很差,表明模型和训练集数据框的索引不再匹配.

the result quality is poor, suggesting that the indices from the model and the training set dataframe no longer match.

当我将其与未更新训练集数据框(再次使用更新的模型)进行比较时,结果很好-尽管很明显,我缺少新文档.

When I compare it against the non-updated training set dataframe (again using the updated model) the results are fine - though, obviously I'm missing the new documents.

无论如何都需要对两者进行更新,因为我希望能够在不完全重新训练模型的情况下对模型进行频繁的更新?

Is there anyway to have both updated, as I want to be able to make frequent updates to the model without a full retrain of the model?

推荐答案

Gensim Doc2Vec尚未正式支持词汇扩展(通过build_vocab(..., update=True)),因此此处的模型行为未定义为做任何有用的事情.实际上,我认为任何现有的文档标签都将被完全丢弃,并替换为最新语料库中的任何文档标签. (此外,当尝试将update_vocab()Doc2Vec结合使用时,还有许多未解决的内存故障过程崩溃报告,例如

Gensim Doc2Vec doesn't yet have official support for expanding-the-vocabulary (via build_vocab(..., update=True)), so the model's behavior here is not defined to do anything useful. In fact, I think any existing doc-tags will be completely discarded and replaced with any in the latest corpus. (Additionally, there are outstanding unresolved reports of memory-fault process-crashes when trying to use update_vocab() with Doc2Vec, such as this issue.)

即使这样行​​之有效,如果在文本不同于初始训练集的模型上继续调用train(),仍然存在许多模糊的平衡问题需要考虑.特别是,每次这样的培训课程都会使模型在新示例上变得更好,但会失去原始培训的价值,可能使模型在某些情况下或在整体上变得更糟.

Even if that worked, there are a number of murky balancing issues to consider if ever continuing to call train() on a model with texts different than the initial training-set. In particular, each such training session will nudge the model to be better on the new examples, but lose value of the original training, possibly making the model worse for some cases or overall.

语料库不断增长的最有防御性的策略是,偶尔将所有训练示例合并为一个语料库,从头进行再培训. 我最近发布的文章中讨论了滚动更新模型的可能过程的另一纲要. gensim讨论列表.

The most defensible policy with a growing corpus would be to occasionally retrain from scratch with all training examples combined into one corpus. Another outline of a possible process for rolling updates to a model was discussed in my recent post to the gensim discussion list.

关于您的设置的其他一些评论:

A few other comments on your setup:

  • 同时使用分层softmax(hs=1)和负采样(negative> 0)会增加模型的大小和训练时间,但与仅使用具有更多迭代的一种模式相比可能没有任何优势(或其他调整)–因此很少同时启用两种模式

  • using both hierarchical-softmax (hs=1) and negative sampling (with negative > 0) increases the model size and training time, but may not offer any advantage compared to using just one mode with more iterations (or other tweaks) – so it's rare to have both modes active

未指定iter,则表示您使用的是默认继承自Word2Vec的"5",而已发布的Doc2Vec工作通常会使用10-20次或更多次迭代

by not specifying an iter, you're using the default-inherited-from-Word2Vec of '5', while published Doc2Vec work often uses 10-20 or more iterations

许多报告infer_vector的效果更好,其可选参数steps的值要高得多(默认值仅为5),和/或较小的alpha值(其中默认为0.1)

many report infer_vector working better with a much-higher value for its optional parameter steps (which has a default of only 5), and/or with smaller values of alpha (which has a default of 0.1)

这篇关于更新gensim Doc2Vec模型的培训文档的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆