什么是doc2vec培训迭代? [英] What are doc2vec training iterations?

查看:78
本文介绍了什么是doc2vec培训迭代?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是doc2vec的新手。我最初试图理解doc2vec,下面提到的是我使用Gensim的代码。如我所愿,我得到了两个文档的训练有素的模型和文档向量。

I am new to doc2vec. I was initially trying to understand doc2vec and mentioned below is my code that uses Gensim. As I want I get a trained model and document vectors for the two documents.

但是,我想知道在多个时期重新训练模型的好处以及如何在Gensim中完成?我们可以使用 iter alpha 参数来做到吗,还是必须在单独的 for循环?请让我知道如何更改以下代码以训练20个世代的模型。

However, I would like to know the benefits of retraining the model in several epoches and how to do it in Gensim? Can we do it using iter or alpha parameter or do we have to train it in a seperate for loop? Please let me know how I should change the following code to train the model for 20 epoches.

此外,我有兴趣知道word2vec需要多次训练迭代

Also, I am interested in knowing is the multiple training iterations are needed for word2vec model as well.

# Import libraries
from gensim.models import doc2vec
from collections import namedtuple

# Load data
doc1 = ["This is a sentence", "This is another sentence"]

# Transform data
docs = []
analyzedDocument = namedtuple('AnalyzedDocument', 'words tags')
for i, text in enumerate(doc1):
    words = text.lower().split()
    tags = [i]
    docs.append(analyzedDocument(words, tags))

# Train model
model = doc2vec.Doc2Vec(docs, size = 100, window = 300, min_count = 1, workers = 4)

# Get the vectors
model.docvecs[0]
model.docvecs[1]


推荐答案

Word2Vec 和相关算法(例如 Ve段ctor的aka Doc2Vec )通常会对文本语料进行多次培训。

Word2Vec and related algorithms (like 'Paragraph Vectors' aka Doc2Vec) usually make multiple training passes over the text corpus.

Gensim的 Word2Vec / Doc2Vec 允许通过次数由 iter 参数指定,如果您还在对象初始化中提供了语料库以触发立即训练。 (上面的代码通过向 Doc2Vec(docs,...)构造函数调用提供 docs 来实现此目的。)

Gensim's Word2Vec/Doc2Vec allows the number of passes to be specified by the iter parameter, if you're also supplying the corpus in the object initialization to trigger immediate training. (Your code above does this by supplying docs to the Doc2Vec(docs, ...) constructor call.)

如果未指定,gensim使用的默认 iter 值为5,以匹配Google原始word2vec使用的默认值。 c发布。因此,您上面的代码已经使用了5个培训通行证。

If unspecified, the default iter value used by gensim is 5, to match the default used by Google's original word2vec.c release. So your code above is already using 5 training passes.

发布的 Doc2Vec 工作通常使用10到20次通行证。如果您想进行20次传递,则可以将 Doc2Vec 初始化更改为:

Published Doc2Vec work often uses 10-20 passes. If you wanted to do 20 passes instead, you could change your Doc2Vec initialization to:

model = doc2vec.Doc2Vec(docs, iter=20, ...)

Doc2Vec 通常为每个文档使用唯一的标识符标签,更多的迭代可能更重要,因此在训练过程中,每个文档向量都会多次进行训练,因为模型逐渐完善。另一方面,由于 Word2Vec 语料库中的单词可能会出现在整个语料库中的任何位置,因此每个单词的相关向量将在此过程的早期,中期和后期进行多次调整。随着模型的改进–即使只有一次通过。 (因此,对于庞大的 Word2Vec 语料库,可以使用少于默认次数的通行证。)

Because Doc2Vec often uses unique identifier tags for each document, more iterations can be more important, so that every doc-vector comes up for training multiple times over the course of the training, as the model gradually improves. On the other hand, because the words in a Word2Vec corpus might appear anywhere throughout the corpus, each words' associated vectors will get multiple adjustments, early and middle and late in the process as the model improves – even with just a single pass. (So with a giant, varied Word2Vec corpus, it's thinkable to use fewer than the default-number of passes.)

不需要需要做自己的循环,而大多数用户则不需要。如果您确实管理单独的 build_vocab() train()步骤,而不是简单地提供初始化调用中的 docs 语料库触发立即训练,然后必须为 epochs 自变量> train() –它将执行该次数的通行证,因此您仍然只需要调用一次 train()

You don't need to do your own loop, and most users shouldn't. If you do manage the separate build_vocab() and train() steps yourself, instead of the easier step of supplying the docs corpus in the initializer call to trigger immediate training, then you must supply an epochs argument to train() – and it will perform that number of passes, so you still only need one call to train().

这篇关于什么是doc2vec培训迭代?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆