如何使用gensim LDA获得文档的完整主题分发? [英] How to get a complete topic distribution for a document using gensim LDA?

查看:380
本文介绍了如何使用gensim LDA获得文档的完整主题分发?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我像这样训练我的lda模型

When I train my lda model as such

dictionary = corpora.Dictionary(data)
corpus = [dictionary.doc2bow(doc) for doc in data]
num_cores = multiprocessing.cpu_count()
num_topics = 50
lda = LdaMulticore(corpus, num_topics=num_topics, id2word=dictionary, 
workers=num_cores, alpha=1e-5, eta=5e-1)

我想为每个文档的所有num_topics获取完整的主题分布.也就是说,在这种情况下,我希望每个文档有50个主题对分发 有所贡献,我希望能够访问所有50个主题的贡献.如果严格遵守LDA的数学运算,则LDA应该执行此输出操作.但是,gensim仅输出超出特定阈值的主题,如

I want to get a full topic distribution for all num_topics for each and every document. That is, in this particular case, I want each document to have 50 topics contributing to the distribution and I want to be able to access all 50 topics' contribution. This output is what LDA should do if adhering strictly to the mathematics of LDA. However, gensim only outputs topics that exceed a certain threshold as shown here. For example, if I try

lda[corpus[89]]
>>> [(2, 0.38951721864890398), (9, 0.15438596408262636), (37, 0.45607443684895665)]

仅显示了对文档89贡献最大的3个主题.我在上面的链接中尝试了此解决方案,但是这对我不起作用.我仍然得到相同的输出:

which shows only 3 topics that contribute most to document 89. I have tried the solution in the link above, however this does not work for me. I still get the same output:

theta, _ = lda.inference(corpus)
theta /= theta.sum(axis=1)[:, None]

产生相同的输出,即每个文档只有2,3个主题.

produces the same output i.e. only 2,3 topics per document.

我的问题是如何更改此阈值,以便可以访问 每个 Full 主题分布文档?无论主题对文档的贡献多么微不足道,我如何都能访问完整的主题分布?我想要完整分发的原因是,我可以执行 KL相似性在文档分发之间进行搜索.

My question is how do I change this threshold so I can access the FULL topic distribution for each document? How can I access the full topic distribution, no matter how insignificant the contribution of a topic to a document? The reason I want the full distribution is so I can perform a KL similarity search between documents' distribution.

预先感谢

推荐答案

似乎还没有人答复,所以我会尽力回答这个问题,并尽可能给gensim

It doesnt seem that anyone has replied yet, so I'll try and answer this as best I can given the gensim documentation.

似乎在训练模型以获得所需结果时需要将参数minimum_probability设置为0.0:

It seems you need to set a parameter minimum_probability to 0.0 when training the model to get the desired results:

lda = LdaMulticore(corpus=corpus, num_topics=num_topics, id2word=dictionary, workers=num_cores, alpha=1e-5, eta=5e-1,
              minimum_probability=0.0)

lda[corpus[233]]
>>> [(0, 5.8821799358842424e-07),
 (1, 5.8821799358842424e-07),
 (2, 5.8821799358842424e-07),
 (3, 5.8821799358842424e-07),
 (4, 5.8821799358842424e-07),
 (5, 5.8821799358842424e-07),
 (6, 5.8821799358842424e-07),
 (7, 5.8821799358842424e-07),
 (8, 5.8821799358842424e-07),
 (9, 5.8821799358842424e-07),
 (10, 5.8821799358842424e-07),
 (11, 5.8821799358842424e-07),
 (12, 5.8821799358842424e-07),
 (13, 5.8821799358842424e-07),
 (14, 5.8821799358842424e-07),
 (15, 5.8821799358842424e-07),
 (16, 5.8821799358842424e-07),
 (17, 5.8821799358842424e-07),
 (18, 5.8821799358842424e-07),
 (19, 5.8821799358842424e-07),
 (20, 5.8821799358842424e-07),
 (21, 5.8821799358842424e-07),
 (22, 5.8821799358842424e-07),
 (23, 5.8821799358842424e-07),
 (24, 5.8821799358842424e-07),
 (25, 5.8821799358842424e-07),
 (26, 5.8821799358842424e-07),
 (27, 0.99997117731831464),
 (28, 5.8821799358842424e-07),
 (29, 5.8821799358842424e-07),
 (30, 5.8821799358842424e-07),
 (31, 5.8821799358842424e-07),
 (32, 5.8821799358842424e-07),
 (33, 5.8821799358842424e-07),
 (34, 5.8821799358842424e-07),
 (35, 5.8821799358842424e-07),
 (36, 5.8821799358842424e-07),
 (37, 5.8821799358842424e-07),
 (38, 5.8821799358842424e-07),
 (39, 5.8821799358842424e-07),
 (40, 5.8821799358842424e-07),
 (41, 5.8821799358842424e-07),
 (42, 5.8821799358842424e-07),
 (43, 5.8821799358842424e-07),
 (44, 5.8821799358842424e-07),
 (45, 5.8821799358842424e-07),
 (46, 5.8821799358842424e-07),
 (47, 5.8821799358842424e-07),
 (48, 5.8821799358842424e-07),
 (49, 5.8821799358842424e-07)]

这篇关于如何使用gensim LDA获得文档的完整主题分发?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆