Doc2Vec:区分句子和文档 [英] Doc2Vec: Differentiate Sentence and Document

查看:58
本文介绍了Doc2Vec:区分句子和文档的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我只是在玩gensim的Doc2Vec,分析stackexchange转储以分析问题的语义相似性以识别重复项.

Doc2Vec教程上的教程似乎将输入描述为带有标记的句子./p>

但是原始文件: Doc2Vec-Paper 声称该方法可用于推断段落/文档的固定长度向量.

有人可以在这种情况下解释句子和文档之间的区别,以及我将如何推断段落向量.

由于一个问题有时可以跨越多个句子,我以为,在培训期间,我会给同一问题产生的句子加上相同的标签,但是我该怎么做才能在看不见的问题上推断向量呢?

和这个笔记本: Doc2Vec-Notebook

似乎是TRAIN和TEST文档的训练载体,有人可以解释其背后的原理吗,我应该这样做吗?

解决方案

Gensim的Doc2Vec希望您提供与示例TaggedDocument类相同的对象形状的文本示例:同时具有 words 标签属性.

words 是文本的字符串标记的有序序列–它们可能是一个句子值,一个段落或一个较长的文档,由您决定.

tags 是要从文本中学习的标记的列表,例如纯整数或字符串标记,它们以某种方式用来命名相应的文本.在原始的段落向量"论文中,它们只是每个文本的唯一ID –例如从0单调递增的整数.(因此,第一个TaggedDocument的 tags 可能只是 [0],下一个 [1] 等)

该算法仅适用于文本块,而无需知道句子/段落/文档等可能是什么.(出于Doc2Vec的目的,只需考虑它们都是文档",然后您决定语料库中哪种是正确的文档".)令牌化甚至保留标点符号(例如句子之间的句点)作为独立的令牌,这甚至很常见..

推断是通过 infer_vector()方法进行的,该方法采用强制性参数 doc_words ,该参数应为字符串令牌列表,就像所提供的那样在训练过程中显示为文字 words .

您不对推断的文本提供任何标签:Doc2Vec只是给您提供了一个原始向量,该向量在模型学习的关系范围内非常适合文本.(也就是说:向量擅长预测文本的单词,就像批量训练期间学习的向量和内部模型权重擅长预测训练文本的单词一样.)

请注意,通过增加可选的 steps 参数(并可能将推理起始 alpha 减小为更像批量训练起始alpha),许多人已经从推理中发现了更好的结果.,0.025至0.05).

doc2vec-IMDB演示笔记本尝试从原始的Paragraph Vectors论文重现其中一项实验,因此它遵循此处描述的内容,以及一位作者(Mikolov)发布的演示脚本.由于测试"文档(没有其目标标签/已知情感)可能仍然可用,因此在培训时可以帮助改善文本建模,因此在无人监督的Doc2Vec培训中包括其原始文本可能是合理的.(在训练使用文档向量的分类器时,不会使用它们的已知标签.)

(请注意,目前,2017年2月,与目前的gensim Doc2Vec默认和最佳实践相比,doc2vec-IMDB演示笔记本有些过时了-特别是这些模型未提供正确的选择显式的 iter = 1 值,以使以后的手动循环和 train()恰到好处地通过训练.)

I am just playing around with Doc2Vec from gensim, analysing stackexchange dump to analyze semantic similarity of questions to identify duplicates.

The tutorial on Doc2Vec-Tutorial seems to describe the input as tagged sentences.

But the original paper: Doc2Vec-Paper claims that the method can be used to infer fixed length vectors of paragraphs/documents.

Can someone explain the difference between a sentence and a document in this context, and how i would go about inferring paragraph vectors.

Since a question can sometimes span multiple sentences, I thought, during training i will give sentences arising from the same question the same tags, but then how would i do this to infer_vector on unseen questions?

And this notebook : Doc2Vec-Notebook

seems to be training vectors on TRAIN and TEST docs, can someone explain the rationale behind this and should i do the same?

解决方案

Gensim's Doc2Vec expects you to provide text examples of the same object-shape as the example TaggedDocument class: having both a words and a tags property.

The words are an ordered sequence of string tokens of the text – they might be a single sentence worth, or a paragraph, or a long document, it's up to you.

The tags are a list of tags to be learned from the text – such as plain ints, or string-tokens, that somehow serve to name the corresponding texts. In the original 'Paragraph Vectors' paper, they were just unique IDs for each text – such as integers monotonically increasing from 0. (So the first TaggedDocument might have a tags of just [0], the next [1], etc.)

The algorithm just works on chunks of text, without any idea of what a sentence/paragraph/document etc might be. (Just consider them all 'documents' for the purpose of Doc2Vec, with you deciding what's the right kind of 'document' from your corpus.) It's even common for the tokenization to retain punctuation, such as the periods between sentences, as standalone tokens.

Inference occurs via the infer_vector() method, which takes a mandatory parameter doc_words, which should be a list-of-string-tokens just like those that were supplied as text words during training.

You don't supply any tags on inferred text: Doc2Vec just gives you back a raw vector that, within the relationships learned by the model, fits the text well. (That is: the vector is good at predicting the text's words, in the same way that the vectors and internal model weights learned during bulk training were good at prediction the training texts' words.)

Note that many have found better results from inference by increasing the optional steps parameter (and possibly decreasing the inference starting alpha to be more like the bulk-training starting alpha, 0.025 to 0.05).

The doc2vec-IMDB demo notebook tries to reproduce one of the experiments from the original Paragraph Vectors paper, so it's following what's described there, and a demo script that one of the authors (Mikolov) once released. Since 'test' documents (withoout their target-labels/known-sentiments) may still be available, at training time, to help improve the text-modelling, it can be reasonable to include their raw texts during the unsupervised Doc2Vec training. (Their known-labels are not used when training the classifier which uses the doc-vectors.)

(Note that at the moment, February 2017, the doc2vec-IMDB demo notebook is a little out-of-date compared to the current gensim Doc2Vec defaults & best-practices – in particular the models aren't given the right explicit iter=1 value to make the later manual loop-and-train() do just the right umber of training passes.)

这篇关于Doc2Vec:区分句子和文档的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆