如何在doc2vec中找到文档中最相似的术语/单词? [英] How to find most similar terms/words of a document in doc2vec?

查看:129
本文介绍了如何在doc2vec中找到文档中最相似的术语/单词?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经使用Doc2vec将文档转换为向量,之后,我将这些向量用于聚类,并找出与每个聚类的质心最接近/最相似的5个文档。现在,我需要找到这些文档中最主要或最重要的术语,以便找出每个群集的特征。
我的问题是,有什么方法可以找出Doc2vec中文档中最主要或最相似的术语/单词。我在Doc2vec实现中使用python的gensim包

I have applied Doc2vec to convert documents into vectors.After that, I used the vectors in clustering and figured out the 5 nearest/most similar document to the centroid of each cluster. Now I need to find the most dominant or important terms of these documents so that I can figure out the characteristics of each cluster. My question is is there any way to figure out the most dominat or simlar terms/word of a document in Doc2vec . I am using python's gensim package for the Doc2vec implementaton

推荐答案

要找出群集中最主要的单词,可以使用任何这两种经典方法中的一种。我个人发现第二种方法非常有效。

To find out the most dominant words of your clusters, you can use any of these two classic approaches. I personally found the second one very efficient and effective for this purpose.


  • 潜在Drichlet分配(LDA):一种主题建模算法,可以根据给定的文档集合为您提供一组主题。您可以将集群中的一组相似文档视为一个文档,并应用LDA生成主题并查看文档中主题的分布。

  • Latent Drichlet Allocation (LDA): A topic modelling algorithm that will give you a set of topic given a collection of documents. You can treat the set of similar documents in the clusters as one document and apply LDA to generate the topics and see topic distributions across documents.

TF-IDF: TF-IDF在给定文档集合的情况下计算单词对文档的重要性。因此,要查找最重要的关键字/语法,您可以为文档中出现的每个单词计算TF-IDF。 TF-IDF最高的单词就是您的关键字。因此:

TF-IDF: TF-IDF calculate the importance of a word to a document given a collection of documents. Therefore, to find the most important keywords/ngrams, you can calculate TF-IDF for every word that appears in the documents. The words with the highest TF-IDF then are you keywords. So:


  • 根据包含该关键字的文档数,为文档中出现的每个单词计算IDF

  • 连接相似文档的文本(我称其为超级文档),然后为该超级文档中出现的每个单词计算TF

  • calculate TF * IDF表示每个单词...然后是TA DAAA ...您将关键字与每个群集关联。

请参阅此处的第5.1节,详细了解 TF-IDF

Take a look at Section 5.1 here for more details on the use of TF-IDF.

这篇关于如何在doc2vec中找到文档中最相似的术语/单词?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆