主题建模评价：如何理解连贯系数/cv为0.4，是好是坏？ [英] Evaluation of topic modeling: How to understand a coherence value / c_v of 0.4, is it good or bad?

查看：35 发布时间：2022/3/2 9:55:23 data-science lda topic-modeling

本文介绍了主题建模评价：如何理解连贯系数/cv为0.4，是好是坏？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想知道一致性分数为0.4是好是坏？我使用LDA作为主题建模算法。

此上下文中的平均一致性分数是多少？

推荐答案

连贯性度量主题内单词之间的相对距离。有两种主要的C_V类型，通常是0<；x<；1和UMass14<；x<；14。除非被测量的词是相同的单词或二元语法，否则很少看到连贯性为1或+.9。就像United和States可能会返回~.94的连贯性分数，或者HERO和HERO会返回连贯性1。主题的整体连贯性分数是词与词之间距离的平均值。如果我使用的是c_v，我会努力在我的LDAS中取得0.7分，我认为这是一个很强的主题相关性。我会说：

.3不好

.4为低

.55可以

.65可能已经是最好的了

.7很好

.8不太可能并且

.9可能是错误的

低一致性修复：

调整参数alpha=.1、beta=.01或.001、RANDOM_STATE=123等
获取更好的数据
在0.4版本中，您可能有错误的主题数量，请签出https://datascienceplus.com/evaluation-of-topic-modeling-topic-coherence/即所谓的肘部方法-它为您提供了最佳主题数量的图表，从而在您的数据集中实现最大的一致性。我使用的是具有相当好一致性的mallet，这里是检查不同主题数量一致性的代码：

def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3):
    """
    Compute c_v coherence for various number of topics

    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    texts : List of input texts
    limit : Max num of topics

    Returns:
    -------
    model_list : List of LDA topic models
    coherence_values : Coherence values corresponding to the LDA model with respective number of topics
    """
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=num_topics, id2word=id2word)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())

    return model_list, coherence_values
# Can take a long time to run.
model_list, coherence_values = compute_coherence_values(dictionary=id2word, corpus=corpus, texts=data_lemmatized, start=2, limit=40, step=6)
# Show graph
limit=40; start=2; step=6;
x = range(start, limit, step)
plt.plot(x, coherence_values)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.legend(("coherence_values"), loc='best')
plt.show()

# Print the coherence scores
for m, cv in zip(x, coherence_values):
    print("Num Topics =", m, " has Coherence Value of", round(cv, 4))
    
# Select the model and print the topics
optimal_model = model_list[3]
model_topics = optimal_model.show_topics(formatted=False)
pprint(optimal_model.print_topics(num_words=10))

我希望这对您有帮助：)

这篇关于主题建模评价：如何理解连贯系数/cv为0.4，是好是坏？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

主题建模评价：如何理解连贯系数/cv为0.4，是好是坏？ [英] Evaluation of topic modeling: How to understand a coherence value / c_v of 0.4, is it good or bad?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

主题建模评价：如何理解连贯系数/cv为0.4，是好是坏？ [英] Evaluation of topic modeling: How to understand a coherence value / c_v of 0.4, is it good or bad?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭