gensim中原始LDA的术语加权 [英] Term weighting for original LDA in gensim

查看：113 发布时间：2020/4/30 8:39:58 python lda topic-modeling gensim

本文介绍了gensim中原始LDA的术语加权的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用gensim库将LDA应用于一组文档.使用gensim，无论术语权重是什么，我都可以将LDA应用于语料库:binary，tf，tf-idf ...

I am using the gensim library to apply LDA to a set of documents. Using gensim I can apply LDA to a corpus whatever the term weights are: binary, tf, tf-idf...

我的问题是，原始应该使用什么术语加权? LDA ?如果我没有正确理解，权重应该是术语频率，但是我不确定.

My question is, what is the term weighting that should be used for the original LDA? If I have understood correctly the weights should be term frequencies, but I am not sure.

推荐答案

它应该是表示为单词袋"的语料库.或者，是的，术语计数列表.

It should be a corpus represented as a "bag of words". Or, yes, lists of term counts.

正确的格式是Gensim网页上的第一个教程中定义的corpus格式. >(这些非常有用).

The correct format is that of the corpus defined in the first tutorial on the Gensim webpage (these are really useful).

也就是说，如果您具有Radim教程中定义的dictionary和以下文档，

Namely, if you have a dictionary as defined in Radim's tutorial, and the following documents,

doc1 = ['big', 'data', 'technique', 'lots', 'of', 'cash']
doc2 = ['this', 'document', 'has', 'words']
docs = [doc1, doc2]

然后，您的语料库(用于LDA)应该是以下形式的元组列表的可迭代对象(例如列表):(dictKey, count)，其中dk指术语的字典键，并计数是它在文档中出现的次数.

then your corpus (for use with LDA) should be an iterable object (such as a list) of lists of tuples of the form: (dictKey, count), where dk refers to the dictionary key of a term, and count is the number of times it occurs in the document. This is done for you with

corpus = [dictionary.doc2bow(doc) for doc in docs]

该doc2bow函数的意思是文档到单词袋".

That doc2bow function means "document to bag of words".

这篇关于gensim中原始LDA的术语加权的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

gensim中原始LDA的术语加权 [英] Term weighting for original LDA in gensim

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

gensim中原始LDA的术语加权 [英] Term weighting for original LDA in gensim

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭