如何在gensim中使用build_vocab? [英] how to use build_vocab in gensim?

查看：1985 发布时间：2020/11/13 6:19:29 nlp word2vec gensim doc2vec

本文介绍了如何在gensim中使用build_vocab?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

Build_vocab扩展了我的旧词汇吗?

例如，我的想法是当我使用doc2vec训练模型时，它只是从数据集中构建词汇表.如果要扩展它，我需要使用build_vocab()

For example, my idea is when I use doc2vec(s) to train a model, it just builds the vocabulary from the datasets. If I want to extend it, I need to use build_vocab()

我应该在哪里使用它?我应该把它放在"gensim.doc2vec()"之后吗?

例如:

sentences = gensim.models.doc2vec.TaggedLineDocument(f_path)
dm_model = gensim.models.doc2vec.Doc2Vec(sentences, dm=1, size=300, window=8, min_count=5, workers=4)
dm_model.build_vocab()

推荐答案

您应该遵循gensim文档/教程/笔记本或在线教程中的工作示例，以了解哪些步骤是必需的，以什么顺序进行.

You should follow working examples in gensim documentation/tutorials/notebooks or online tutorials to understand which steps are necessary and in what order.

特别是如果您在Doc2Vec()初始化时提供了可迭代的sentences语料库，它将自动进行词汇发现和所有培训–因此您不要然后需要自己致电build_vocab()或train().而且，您将从不不带任何参数调用build_vocab(). (在文档或在线中，任何有效的示例都无法完成您的代码所执行的操作-因此，在您遵循这些示例并知道它们为什么要执行操作之前，请不要即兴发挥新功能.)

In particular, if you provide your sentences corpus iterable on the Doc2Vec() initialization, it will automatically do both the vocabulary-discovery pass and all training – so you don’t then need to call either build_vocab() or train() yourself. And further, you would never call build_vocab() with no arguments. (No working example in docs or online will do what your code does – so don’t improvise new things until you’ve followed the examples and know why they do what they do.)

build_vocab()有一个可选的update自变量，其目的是允许从较早的培训课程中扩展词汇量(以准备使用较新的单词进行进一步的培训).但是，它仅针对Word2Vec型号进行了开发/测试-有报道称与Doc2Vec一起使用时会导致崩溃.即使在Word2Vec中，在所有训练模式下，其总体效果和最佳使用方式也不清楚.因此，除了可以阅读&自行解释源代码以及许多涉及的权衡.如果您收到大量带有新词的新文本，则最好的基础操作方法是评估所有单词示例，从零开始重新训练，并且最容易评估/推理.

There is an optional update argument to build_vocab(), which purports to allow the expansion of a vocabulary from an earlier training session (in preparation for further training with the newer words). HOWEVER, it’s only been developed/tested with regard to Word2Vec models – there are reports it causes crashes when used with Doc2Vec. And even in Word2Vec, its overall effects and best-ways-to-use aren’t clear, across all training modes. So I don’t recommend its use except for experts who can read & interpret the source code, and many involved tradeoffs, on their own. If you receive a chunk of new texts, with new words, the best-grounded course of action, and easiest to evaluate/reason-about, is to re-train from scratch, using a combined corpus of all text examples.

这篇关于如何在gensim中使用build_vocab?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在gensim中使用build_vocab? [英] how to use build_vocab in gensim?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何在gensim中使用build_vocab? [英] how to use build_vocab in gensim?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭