Gensim LSI方法show_topics返回的概率为何为负? [英] How come probabilities returned by Gensim LSI method show_topics are negative?

查看:231
本文介绍了Gensim LSI方法show_topics返回的概率为何为负?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

(0, '0.707 *" + 0.707 *"quotरोप"; + -0.000 "quot्ध" + -0.000 *"*न्म" +' '0.000 *'बेल्जियम"; + 0.000 *"िंगडम" + 0.000 *"नेपाल; + 0.000 *ऑफ़"; +' '-0.000 युन" + -0.000 "स्थली" *')]
如文档所述
show_topics(num_topics = -1,num_words = 10,log = False,formatted = True)
返回num_topics最重要的主题(默认情况下全部返回). 对于每个主题,请显示num_words个最重要的单词(默认为10个单词).

(0, '0.707*"उत्तरपश्चिमी" + 0.707*"यूरोप" + -0.000"बुद्ध" + -0.000*"जन्म" + ' '0.000*"बेल्जियम" + 0.000*"किंगडम" + 0.000*"नेपाल" + 0.000*"ऑफ़" + ' '-0.000"युन" + -0.000"स्थली"*')]
Where as the documentation says
show_topics(num_topics=-1, num_words=10, log=False, formatted=True)
Return num_topics most significant topics (return all by default). For each topic, show num_words most significant words (10 words by default).

主题以列表形式返回-如果格式为True,则为字符串列表,如果为False,则为(单词,概率)2元组的列表.

The topics are returned as a list – a list of strings if formatted is True, or a list of (word, probability) 2-tuples if False.

如果log为True,则将此结果也输出到log.

If log is True, also output this result to log.

def preprocessing(corpus):
    for document in corpus:
        doc = strip_short(document,3)
        doc = strip_punctuation(doc)
        yield word_tokenize(doc)
texts = preprocessing(corpus)
dictionary = corpora.Dictionary(texts)
dictionary.filter_extremes(no_below=1, keep_n=25000)

doc_term_matrix = [dictionary.doc2bow(tokens) for tokens in preprocessing(corpus)]
tfidf = models.TfidfModel(doc_term_matrix)
corpus_tfidf = tfidf[doc_term_matrix]

lsi = models.LsiModel(corpus_tfidf, id2word=dictionary)
pprint(lsi.show_topics(num_topics=4, num_words=10))

[(0,
  '0.707*"उत्तरपश्चिमी" + 0.707*"यूरोप" + -0.000*"बुद्ध" + -0.000*"जन्म" + '
  '0.000*"बेल्जियम" + 0.000*"किंगडम" + 0.000*"नेपाल" + 0.000*"ऑफ़" + '
  '-0.000*"युन" + -0.000*"स्थली"'),
 (1,
  '0.577*"किंगडम" + 0.577*"बेल्जियम" + 0.577*"ऑफ़" + -0.000*"जन्म" + '
  '-0.000*"बुद्ध" + -0.000*"भगवान" + -0.000*"स्थित" + -0.000*"लुंबिनी" + '
  '-0.000*"उत्तरपश्चिमी" + -0.000*"यूरोप"'),
 (2,
  '0.354*"जन्म" + 0.354*"भगवान" + 0.354*"स्थित" + 0.354*"स्थली" + 0.354*"युन" '
  '+ 0.354*"बुद्ध" + 0.354*"लुंबिनी" + 0.354*"नेपाल" + 0.000*"उत्तरपश्चिमी" + '
  '0.000*"यूरोप"')]

推荐答案

感谢使用SO.

show_topics 为您提供了语料库中最重要的主题.您看到的概率是每个单词对主题的贡献.例如"和यूरोप"各自贡献0.707,而बेल्जियम"贡献为为定义该主题贡献了0.000.

The show_topics provides you the most significant topics from the corpus. The probabilities that you see are the contribution from each word towards the topic. For e.g. "उत्तरपश्चिमी" and "यूरोप" have contribution of 0.707 each while "बेल्जियम" has 0.000 contribution towards defining this topic.

当显示单词的贡献时,该模型显示最大的绝对值,但是由于截断了接近0(例如-0.0000008)的浮点数,因此显示为-0.00.

When showing the contribution of word, the model displays the greatest absolute value but due to truncation of floating numbers that are close to 0( say -0.0000008),they are shown as -0.00.

参考文献: https://radimrehurek.com/gensim/models/lsimodel.html

这篇关于Gensim LSI方法show_topics返回的概率为何为负?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆