分层狄利克雷过程 Gensim 主题数与语料库大小无关 [英] Hierarchical Dirichlet Process Gensim topic number independent of corpus size

查看:38
本文介绍了分层狄利克雷过程 Gensim 主题数与语料库大小无关的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在一组文档上使用 Gensim HDP 模块.

<预><代码>>>>hdp = models.HdpModel(corpusB, id2word=dictionaryB)>>>主题 = hdp.print_topics(主题=-1,topn=20)>>>len(主题)150>>>hdp = models.HdpModel(corpusA, id2word=dictionaryA)>>>主题 = hdp.print_topics(主题=-1,topn=20)>>>len(主题)150>>>len(语料库A)1113>>>len(语料库B)17

为什么主题数与语料长度无关?

解决方案

@user3907335 在这里完全正确:HDP 将计算与分配的截断级别一样多的主题.然而,可能是这样的情况,其中许多主题的发生概率基本上为零.为了在我自己的工作中解决这个问题,我编写了一个方便的小函数,可以粗略估计与每个主题相关的概率权重.请注意,这只是一个粗略的指标:它没有考虑与每个单词相关的概率.即便如此,它还是提供了一个很好的衡量标准,用于衡量哪些主题有意义,哪些主题没有意义:

将pandas导入为pd将 numpy 导入为 npdef topic_prob_extractor(hdp=None, topn=None):topic_list = hdp.show_topics(topics=-1, topn=topn)topic = [int(x.split(':')[0].split(' ')[1]) for x in topic_list]split_list = [x.split(' ') for x in topic_list]权重 = []对于 split_list 中的 lst:子列表 = []进入 lst:如果*"在条目中:sub_list.append(float(entry.split('*')[0]))weights.append(np.asarray(sub_list))sums = [np.sum(x) for x in weights]返回 pd.DataFrame({'topic_id' : 主题, 'weight' : sums})

我假设您已经知道如何计算 HDP 模型.一旦你有了由 gensim 计算出的 hdp 模型,你就可以按如下方式调用该函数:

topic_weights = topic_prob_extractor(hdp, 500)

I am using the Gensim HDP module on a set of documents.

>>> hdp = models.HdpModel(corpusB, id2word=dictionaryB)
>>> topics = hdp.print_topics(topics=-1, topn=20)
>>> len(topics)
150
>>> hdp = models.HdpModel(corpusA, id2word=dictionaryA)
>>> topics = hdp.print_topics(topics=-1, topn=20)
>>> len(topics)
150
>>> len(corpusA)
1113
>>> len(corpusB)
17

Why is the number of topics independent of corpus length?

解决方案

@user3907335 is exactly correct here: HDP will calculate as many topics as the assigned truncation level. However, it may be the case that many of these topics have basically zero probability of occurring. To help with this in my own work, I wrote a handy little function that performs a rough estimate of the probability weight associated with each topic. Note that this is a rough metric only: it does not account for the probability associated with each word. Even so, it provides a pretty good metric for which topics are meaningful and which aren't:

import pandas as pd
import numpy as np 

def topic_prob_extractor(hdp=None, topn=None):
    topic_list = hdp.show_topics(topics=-1, topn=topn)
    topics = [int(x.split(':')[0].split(' ')[1]) for x in topic_list]
    split_list = [x.split(' ') for x in topic_list]
    weights = []
    for lst in split_list:
        sub_list = []
        for entry in lst: 
            if '*' in entry: 
                sub_list.append(float(entry.split('*')[0]))
        weights.append(np.asarray(sub_list))
    sums = [np.sum(x) for x in weights]
    return pd.DataFrame({'topic_id' : topics, 'weight' : sums})

I assume that you already know how to calculate an HDP model. Once you have an hdp model calculated by gensim you call the function as follows:

topic_weights = topic_prob_extractor(hdp, 500)

这篇关于分层狄利克雷过程 Gensim 主题数与语料库大小无关的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆