Gensim LDA中的文档主题分布 [英] Document topical distribution in Gensim LDA

查看:894
本文介绍了Gensim LDA中的文档主题分布的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用玩具语料库得出了LDA主题模型,如下所示:

I've derived a LDA topic model using a toy corpus as follows:

documents = ['Human machine interface for lab abc computer applications',
             'A survey of user opinion of computer system response time',
             'The EPS user interface management system',
             'System and human system engineering testing of EPS',
             'Relation of user perceived response time to error measurement',
             'The generation of random binary unordered trees',
             'The intersection graph of paths in trees',
             'Graph minors IV Widths of trees and well quasi ordering',
             'Graph minors A survey']

texts = [[word for word in document.lower().split()] for document in documents]
dictionary = corpora.Dictionary(texts)

id2word = {}
for word in dictionary.token2id:    
    id2word[dictionary.token2id[word]] = word

我发现,当我使用少量主题来推导模型时,Gensim会针对测试文档的所有潜在主题生成完整的主题分布报告.例如:

I found that when I use a small number of topics to derive the model, Gensim yields a full report of topical distribution over all potential topics for a test document. E.g.:

test_lda = LdaModel(corpus,num_topics=5, id2word=id2word)
test_lda[dictionary.doc2bow('human system')]

Out[314]: [(0, 0.59751626959781134),
(1, 0.10001902477790173),
(2, 0.10001375856907335),
(3, 0.10005453508763221),
(4, 0.10239641196758137)]

但是,当我使用大量主题时,报告将不再完整:

However when I use a large number of topics, the report is no longer complete:

test_lda = LdaModel(corpus,num_topics=100, id2word=id2word)

test_lda[dictionary.doc2bow('human system')]
Out[315]: [(73, 0.50499999999997613)]

在我看来,输出中省略了概率小于某个阈值(我观察到更具体地为0.01)的主题.

It seems to me that topics with a probability less than some threshold (I observed 0.01 to be more specific) are omitted form the output.

我想知道这种行为是否是出于美学考虑?我如何获得所有其他主题上概率质量残差的分布?

I'm wondering if this behaviour is due to some aesthetic considerations? And how can I get the distribution of the probability mass residual over all other topics?

感谢您的友好回答!

推荐答案

阅读,事实证明,概率小于阈值的主题将被忽略.此阈值的默认值为0.01.

Read the source and it turns out that topics with probabilities smaller than a threshold are ignored. This threshold is with a default value of 0.01.

这篇关于Gensim LDA中的文档主题分布的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆