如何将 Gensim doc2vec 与预训练的词向量一起使用? [英] How to use Gensim doc2vec with pre-trained word vectors?

查看:26
本文介绍了如何将 Gensim doc2vec 与预训练的词向量一起使用?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近在 Gensim 中发现了 doc2vec.如何将预训练的词向量(例如在 word2vec 原始网站中找到)与 doc2vec 一起使用?

I recently came across the doc2vec addition to Gensim. How can I use pre-trained word vectors (e.g. found in word2vec original website) with doc2vec?

还是 doc2vec 从用于段落向量训练的相同句子中获取词向量?

Or is doc2vec getting the word vectors from the same sentences it uses for paragraph-vector training?

谢谢.

推荐答案

请注意,DBOW" (dm=0) 训练模式不需要甚至创建词向量作为培训.它只是学习擅长依次预测每个单词的文档向量(很像 word2vec skip-gram 训练模式).

Note that the "DBOW" (dm=0) training mode doesn't require or even create word-vectors as part of the training. It merely learns document vectors that are good at predicting each word in turn (much like the word2vec skip-gram training mode).

(在 gensim 0.12.0 之前,在另一条评论中提到了参数 train_words,一些文档建议将其协同训练单词.但是,我不相信这真的有效.开始在 gensim 0.12.0 中,有一个参数 dbow_words,它可以与 DBOW doc-vectors 同时进行跳跃训练单词.请注意,这使得训练需要更长的时间 - 与 相关的因素window.所以如果你不需要词向量,你仍然可以不使用它.)

(Before gensim 0.12.0, there was the parameter train_words mentioned in another comment, which some documentation suggested will co-train words. However, I don't believe this ever actually worked. Starting in gensim 0.12.0, there is the parameter dbow_words, which works to skip-gram train words simultaneous with DBOW doc-vectors. Note that this makes training take longer – by a factor related to window. So if you don't need word-vectors, you may still leave this off.)

在DM"训练方法(dm=1)中,词向量在过程中与文档向量一起被固有地训练,并且很可能也会影响文档的质量——向量.理论上可以从先前的数据中预初始化词向量.但我不知道有任何强有力的理论或实验理由来相信这会改善文档向量.

In the "DM" training method (dm=1), word-vectors are inherently trained during the process along with doc-vectors, and are likely to also affect the quality of the doc-vectors. It's theoretically possible to pre-initialize the word-vectors from prior data. But I don't know any strong theoretical or experimental reason to be confident this would improve the doc-vectors.

我沿着这些路线进行的一个零碎实验表明,doc-vector 训练开始得更快——在前几次传球后预测质量更好——但这种优势随着传球次数的增加而逐渐消失.保持词向量不变还是让它们随着新的训练继续调整也可能是一个重要的考虑因素……但哪种选择更好可能取决于您的目标、数据集以及预先存在的质量/相关性词向量.

One fragmentary experiment I ran along these lines suggested the doc-vector training got off to a faster start – better predictive qualities after the first few passes – but this advantage faded with more passes. Whether you hold the word vectors constant or let them continue to adjust with the new training is also likely an important consideration... but which choice is better may depend on your goals, data set, and the quality/relevance of the pre-existing word-vectors.

(您可以使用 gensim 0.12.0 中可用的 intersect_word2vec_format() 方法重复我的实验,并尝试通过 syn0_lockf 值.但请记住,这是实验领域:基本的 doc2vec 结果不依赖于重用的词向量,甚至不一定通过重用的词向量进行改进.)

(You could repeat my experiment with the intersect_word2vec_format() method available in gensim 0.12.0, and try different levels of making pre-loaded vectors resistant-to-new-training via the syn0_lockf values. But remember this is experimental territory: the basic doc2vec results don't rely on, or even necessarily improve with, reused word vectors.)

这篇关于如何将 Gensim doc2vec 与预训练的词向量一起使用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆