层次Dirichlet Process Gensim主题编号,与语料库大小无关 [英] Hierarchical Dirichlet Process Gensim topic number independent of corpus size

查看:62
本文介绍了层次Dirichlet Process Gensim主题编号,与语料库大小无关的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在一组文档上使用Gensim HDP模块.

I am using the Gensim HDP module on a set of documents.

>>> hdp = models.HdpModel(corpusB, id2word=dictionaryB)
>>> topics = hdp.print_topics(topics=-1, topn=20)
>>> len(topics)
150
>>> hdp = models.HdpModel(corpusA, id2word=dictionaryA)
>>> topics = hdp.print_topics(topics=-1, topn=20)
>>> len(topics)
150
>>> len(corpusA)
1113
>>> len(corpusB)
17

为什么主题数与语料库长度无关?

Why is the number of topics independent of corpus length?

推荐答案

@ user3907335在这里完全正确:HDP将计算与分配的截断级别一样多的主题.但是,可能其中许多主题的发生概率基本为零.为了在自己的工作中提供帮助,我编写了一个方便的小函数,它对与每个主题相关的概率权重进行了粗略的估计.请注意,这只是一个粗略的度量标准:它不考虑与每个单词相关的概率.即便如此,它仍然提供了一个很好的度量标准,对于哪些主题有意义,哪些主题无效:

@user3907335 is exactly correct here: HDP will calculate as many topics as the assigned truncation level. However, it may be the case that many of these topics have basically zero probability of occurring. To help with this in my own work, I wrote a handy little function that performs a rough estimate of the probability weight associated with each topic. Note that this is a rough metric only: it does not account for the probability associated with each word. Even so, it provides a pretty good metric for which topics are meaningful and which aren't:

import pandas as pd
import numpy as np 

def topic_prob_extractor(hdp=None, topn=None):
    topic_list = hdp.show_topics(topics=-1, topn=topn)
    topics = [int(x.split(':')[0].split(' ')[1]) for x in topic_list]
    split_list = [x.split(' ') for x in topic_list]
    weights = []
    for lst in split_list:
        sub_list = []
        for entry in lst: 
            if '*' in entry: 
                sub_list.append(float(entry.split('*')[0]))
        weights.append(np.asarray(sub_list))
    sums = [np.sum(x) for x in weights]
    return pd.DataFrame({'topic_id' : topics, 'weight' : sums})

我假设您已经知道如何计算HDP模型.有了gensim计算的hdp模型后,您可以按以下方式调用函数:

I assume that you already know how to calculate an HDP model. Once you have an hdp model calculated by gensim you call the function as follows:

topic_weights = topic_prob_extractor(hdp, 500)

这篇关于层次Dirichlet Process Gensim主题编号,与语料库大小无关的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆