什么是doc2vec培训迭代? [英] What are doc2vec training iterations?
问题描述
我是doc2vec的新手。我最初试图理解doc2vec,下面提到的是我使用Gensim的代码。如我所愿,我得到了两个文档的训练有素的模型和文档向量。
I am new to doc2vec. I was initially trying to understand doc2vec and mentioned below is my code that uses Gensim. As I want I get a trained model and document vectors for the two documents.
但是,我想知道在多个时期重新训练模型的好处以及如何在Gensim中完成?我们可以使用 iter
或 alpha
参数来做到吗,还是必须在单独的 for循环
?请让我知道如何更改以下代码以训练20个世代的模型。
However, I would like to know the benefits of retraining the model in several epoches and how to do it in Gensim? Can we do it using iter
or alpha
parameter or do we have to train it in a seperate for loop
? Please let me know how I should change the following code to train the model for 20 epoches.
此外,我有兴趣知道word2vec需要多次训练迭代
Also, I am interested in knowing is the multiple training iterations are needed for word2vec model as well.
# Import libraries
from gensim.models import doc2vec
from collections import namedtuple
# Load data
doc1 = ["This is a sentence", "This is another sentence"]
# Transform data
docs = []
analyzedDocument = namedtuple('AnalyzedDocument', 'words tags')
for i, text in enumerate(doc1):
words = text.lower().split()
tags = [i]
docs.append(analyzedDocument(words, tags))
# Train model
model = doc2vec.Doc2Vec(docs, size = 100, window = 300, min_count = 1, workers = 4)
# Get the vectors
model.docvecs[0]
model.docvecs[1]
推荐答案
Word2Vec
和相关算法(例如 Ve段ctor的aka Doc2Vec
)通常会对文本语料进行多次培训。
Word2Vec
and related algorithms (like 'Paragraph Vectors' aka Doc2Vec
) usually make multiple training passes over the text corpus.
Gensim的 Word2Vec
/ Doc2Vec
允许通过次数由 iter
参数指定,如果您还在对象初始化中提供了语料库以触发立即训练。 (上面的代码通过向 Doc2Vec(docs,...)
构造函数调用提供 docs
来实现此目的。)
Gensim's Word2Vec
/Doc2Vec
allows the number of passes to be specified by the iter
parameter, if you're also supplying the corpus in the object initialization to trigger immediate training. (Your code above does this by supplying docs
to the Doc2Vec(docs, ...)
constructor call.)
如果未指定,gensim使用的默认 iter
值为5,以匹配Google原始word2vec使用的默认值。 c发布。因此,您上面的代码已经使用了5个培训通行证。
If unspecified, the default iter
value used by gensim is 5, to match the default used by Google's original word2vec.c release. So your code above is already using 5 training passes.
发布的 Doc2Vec
工作通常使用10到20次通行证。如果您想进行20次传递,则可以将 Doc2Vec
初始化更改为:
Published Doc2Vec
work often uses 10-20 passes. If you wanted to do 20 passes instead, you could change your Doc2Vec
initialization to:
model = doc2vec.Doc2Vec(docs, iter=20, ...)
Doc2Vec
通常为每个文档使用唯一的标识符标签,更多的迭代可能更重要,因此在训练过程中,每个文档向量都会多次进行训练,因为模型逐渐完善。另一方面,由于 Word2Vec
语料库中的单词可能会出现在整个语料库中的任何位置,因此每个单词的相关向量将在此过程的早期,中期和后期进行多次调整。随着模型的改进–即使只有一次通过。 (因此,对于庞大的 Word2Vec
语料库,可以使用少于默认次数的通行证。)
Because Doc2Vec
often uses unique identifier tags for each document, more iterations can be more important, so that every doc-vector comes up for training multiple times over the course of the training, as the model gradually improves. On the other hand, because the words in a Word2Vec
corpus might appear anywhere throughout the corpus, each words' associated vectors will get multiple adjustments, early and middle and late in the process as the model improves – even with just a single pass. (So with a giant, varied Word2Vec
corpus, it's thinkable to use fewer than the default-number of passes.)
您不需要需要做自己的循环,而大多数用户则不需要。如果您确实管理单独的 build_vocab()
和 train()
步骤,而不是简单地提供初始化调用中的 docs
语料库触发立即训练,然后必须为 epochs
自变量> train() –它将执行该次数的通行证,因此您仍然只需要调用一次 train()
。
You don't need to do your own loop, and most users shouldn't. If you do manage the separate build_vocab()
and train()
steps yourself, instead of the easier step of supplying the docs
corpus in the initializer call to trigger immediate training, then you must supply an epochs
argument to train()
– and it will perform that number of passes, so you still only need one call to train()
.
这篇关于什么是doc2vec培训迭代?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!