pyLDAvis:尝试可视化主题时发生验证错误 [英] pyLDAvis: Validation error on trying to visualize topics

查看:400
本文介绍了pyLDAvis:尝试可视化主题时发生验证错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试使用gensim生成300000条记录的主题.在尝试使主题形象化时,出现验证错误.我可以在模型训练后打印主题,但是使用pyLDAvis失败

I tried generating topics using gensim for 300000 records. On trying to visualize the topics, I get a validation error. I can print the topics after model training, but it fails on using pyLDAvis

# Running and Training LDA model on the document term matrix.
ldamodel1 = Lda(doc_term_matrix1, num_topics=10, id2word = dictionary1, passes=50, workers = 4)

(ldamodel1.print_topics(num_topics=10, num_words = 10))
 #pyLDAvis
d = gensim.corpora.Dictionary.load('dictionary1.dict')
c = gensim.corpora.MmCorpus('corpus.mm')
lda = gensim.models.LdaModel.load('topic.model')

#error on executing this line
data = pyLDAvis.gensim.prepare(lda, c, d)

在pyLDAvis以上运行后,出现以下错误尝试

I got the below error on trying to after running above pyLDAvis

---------------------------------------------------------------------------
ValidationError                           Traceback (most recent call last)
<ipython-input-53-33fd88b65056> in <module>()
----> 1 data = pyLDAvis.gensim.prepare(lda, c, d)
      2 data

C:\ProgramData\Anaconda3\lib\site-packages\pyLDAvis\gensim.py in prepare(topic_model, corpus, dictionary, doc_topic_dist, **kwargs)
    110     """
    111     opts = fp.merge(_extract_data(topic_model, corpus, dictionary, doc_topic_dist), kwargs)
--> 112     return vis_prepare(**opts)

C:\ProgramData\Anaconda3\lib\site-packages\pyLDAvis\_prepare.py in prepare(topic_term_dists, doc_topic_dists, doc_lengths, vocab, term_frequency, R, lambda_step, mds, n_jobs, plot_opts, sort_topics)
    372    doc_lengths      = _series_with_name(doc_lengths, 'doc_length')
    373    vocab            = _series_with_name(vocab, 'vocab')
--> 374    _input_validate(topic_term_dists, doc_topic_dists, doc_lengths, vocab, term_frequency)
    375    R = min(R, len(vocab))
    376 

C:\ProgramData\Anaconda3\lib\site-packages\pyLDAvis\_prepare.py in _input_validate(*args)
     63    res = _input_check(*args)
     64    if res:
---> 65       raise ValidationError('\n' + '\n'.join([' * ' + s for s in res]))
     66 
     67 

ValidationError: 
 * Not all rows (distributions) in topic_term_dists sum to 1.

推荐答案

之所以会发生这种情况,是因为pyLDAvis程序希望模型中的所有文档主题至少出现一次在语料库中.当您在制作完语料库/文本之后并在制作模型之前进行一些预处理时,就会发生这种情况.

This happens because the pyLDAvis program expects that all document topics in the model show up in the corpus at least once. This can happen when you do some preprocessing after making your corpus/text and before making your model.

模型的内部词典中没有在您提供的词典中使用的单词将导致此操作失败,因为现在的概率略小于一个.

A word in the model's internal dictionary that is not used in the dictionary you provide will cause this to fail because now the probability is slightly less than one.

您可以通过以下方法解决此问题:将缺少的单词添加到语料库词典中(或将单词添加到语料库中并由此创建词典),也可以将此行添加到site-packages \ pyLDAvis \ gensim.py代码在断言topic_term_dists.shape [0] == doc_topic_dists.shape [1]"之前(应为〜67行)

You can fix this by either adding the missing words to your corpus dictionary (or adding the words to the corpus and making a dictionary from that) or you can add this line to the site-packages\pyLDAvis\gensim.py code before "assert topic_term_dists.shape[0] == doc_topic_dists.shape[1]" (should be ~line 67)

topic_term_dists = topic_term_dists / topic_term_dists.sum(axis=1)[:, None]

假设您的代码一直运行到那时,这将使主题分布重新正常化而不会缺少dict项目.但是请注意,将所有术语包括在语料库中会更好.

Assuming your code ran till that point, this should renormalize the topic distribution without the missing dict items. But note that it would be better to include all terms in the corpus.

这篇关于pyLDAvis:尝试可视化主题时发生验证错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆