将术语文档矩阵传递给Gensim LDA模型 [英] Passing Term-Document Matrix to Gensim LDA Model

查看:52
本文介绍了将术语文档矩阵传递给Gensim LDA模型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的术语文档矩阵是numpy矩阵格式,并且我有一个字典来表示术语文档矩阵.

My term-document matrix is in a numpy matrix format, and I have a dictionary to represent the of the term-document matrix.

有什么办法可以轻松地将这两个参数传递给Gensim的LDA模型?

Is there any way I can easily pass these two into Gensim's LDA model?

tdMatrix = np.load('tdmatrix.npy')
dictionary = cPickle.load(open('dictionary.p', 'r')) # stores term represented by each column

我可以通过某种方式将此方法传递给gensim.models.ldamodel.LDA吗?

Can I pass this somewhow to gensim.models.ldamodel.LDA?

推荐答案

我相信Gensim使用几乎相同的结构来表示一袋单词语料库,但我认为默认字典或numpy数组不兼容.Gensim的API列出了一些可以容纳各种格式的语料库阅读器",但是这些似乎是为从其他工具包导入数据而构建的.因此,就您而言,也许最简单的解决方案是使用矩阵和字典作为分隔字符串的列表来重建文档.然后将您的列表转换为Gensim的单词语料库,最后转换为LDA,如

I believe Gensim uses pretty much the same structure to represent a bag of words corpus, but I don't think a default dictionary or numpy array would be compatible. Gensim's API lists a few "corpusreaders" that can accommodate various formats, but those seem to be built for importing data from other tool kits. So maybe in your case the easiest solution would be to reconstruct the documents using your matrix and dictionary as a list of separated strings. Then convert your list to Gensim's bag of word corpus and finally to LDA as shown in the tutorials.

这种方法的另一个好处是,您可以应用Gensim的预处理功能并以低频/高频过滤单词.

This approach has the added benefit that you can apply Gensim's preprocessing functions and filter words with low/high frequencies.

这篇关于将术语文档矩阵传递给Gensim LDA模型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆