如何将Gensim doc2vec与预训练的单词向量一起使用? [英] How to use Gensim doc2vec with pre-trained word vectors?

查看:412
本文介绍了如何将Gensim doc2vec与预训练的单词向量一起使用?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近遇到了Gensim的doc2vec新增功能.如何在doc2vec中使用经过预训练的词向量(例如在word2vec原始网站中找到的词向量)?

I recently came across the doc2vec addition to Gensim. How can I use pre-trained word vectors (e.g. found in word2vec original website) with doc2vec?

还是doc2vec从用于段落矢量训练的同一句子中获取单词矢量?

Or is doc2vec getting the word vectors from the same sentences it uses for paragraph-vector training?

谢谢.

推荐答案

请注意,"DBOW"(dm=0)训练模式不需要或甚至不创建单词向量作为训练的一部分.它只是学习擅长依次预测每个单词的文档向量(非常类似于word2vec skip-gram训练模式).

Note that the "DBOW" (dm=0) training mode doesn't require or even create word-vectors as part of the training. It merely learns document vectors that are good at predicting each word in turn (much like the word2vec skip-gram training mode).

(在gensim 0.12.0之前,在另一条注释中提到了参数train_words,一些文档建议使用该参数来共同训练单词.但是,我认为这实际上没有用.从gensim 0.12.0开始,有一个参数dbow_words可以与DBOW doc-vector同时跳过语法训练单词,请注意,这会使训练花费的时间更长–与window相关.向量,您仍然可以关闭它.)

(Before gensim 0.12.0, there was the parameter train_words mentioned in another comment, which some documentation suggested will co-train words. However, I don't believe this ever actually worked. Starting in gensim 0.12.0, there is the parameter dbow_words, which works to skip-gram train words simultaneous with DBOW doc-vectors. Note that this makes training take longer – by a factor related to window. So if you don't need word-vectors, you may still leave this off.)

在"DM"训练方法(dm=1)中,单词向量与doc向量一起在过程中被固有地训练,并且很可能也会影响doc向量的质量.从理论上讲,有可能预先初始化先前数据中的单词向量.但是我不知道有任何强大的理论或实验理由可以确信这会改善文档向量.

In the "DM" training method (dm=1), word-vectors are inherently trained during the process along with doc-vectors, and are likely to also affect the quality of the doc-vectors. It's theoretically possible to pre-initialize the word-vectors from prior data. But I don't know any strong theoretical or experimental reason to be confident this would improve the doc-vectors.

我沿着这些路线进行的一项零碎实验表明,文档向量训练起步较快–在前几次通过之后具有更好的预测质量–但是这种优势随着更多的通过而逐渐减弱.您是否要使单词向量保持恒定还是让它们在新的训练中继续进行调整也可能是一个重要的考虑因素...但是哪种选择更好,则取决于您的目标,数据集以及现有内容的质量/相关性单词向量.

One fragmentary experiment I ran along these lines suggested the doc-vector training got off to a faster start – better predictive qualities after the first few passes – but this advantage faded with more passes. Whether you hold the word vectors constant or let them continue to adjust with the new training is also likely an important consideration... but which choice is better may depend on your goals, data set, and the quality/relevance of the pre-existing word-vectors.

(您可以使用gensim 0.12.0中可用的intersect_word2vec_format()方法重复我的实验,并尝试通过syn0_lockf值使预加载矢量具有抗新训练的不同级别.但是请记住,这是实验范围:doc2vec的基本结果不依赖于重复使用的单词向量,甚至不一定通过重复使用的单词向量来改善.)

(You could repeat my experiment with the intersect_word2vec_format() method available in gensim 0.12.0, and try different levels of making pre-loaded vectors resistant-to-new-training via the syn0_lockf values. But remember this is experimental territory: the basic doc2vec results don't rely on, or even necessarily improve with, reused word vectors.)

这篇关于如何将Gensim doc2vec与预训练的单词向量一起使用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆