相干分数0.4是好是坏? [英] Coherence score 0.4 is good or bad?

查看:150
本文介绍了相干分数0.4是好是坏?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要知道0.4的一致性得分是好还是坏?我使用LDA作为主题建模算法。

I need to know whether coherence score of 0.4 is good or bad? I use LDA as topic modelling algorithm.

在这种情况下,平均相干分数是多少。

What is the average coherence score in this context.

推荐答案

连贯性衡量主题中单词之间的相对距离。 C_V有两种主要类型,通常0 < x < 1和uMass -14< x < 14.很难看到相干性为1或+.9,除非要测量的单词是相同的单词或双字母组。像美国和美国可能会返回〜.94的连贯性得分,或者英雄和英雄会返回1的连贯性。主题的整体连贯性得分是单词之间距离的平均值。如果我使用的是c_v,我尝试在LDA中获得.7,我认为这与主题相关性很强。我会说:

Coherence measures the relative distance between words within a topic. There are two major types C_V typically 0 < x < 1 and uMass -14 < x < 14. It's rare to see a coherence of 1 or +.9 unless the words being measured are either identical words or bigrams. Like United and States would likely return a coherence score of ~.94 or hero and hero would return a coherence of 1. The overall coherence score of a topic is the average of the distances between words. I try and attain a .7 in my LDAs if I'm using c_v I think that is a strong topic correlation. I would say:


  • .3不好

  • .3 is bad

.4较低

.55

.65可能会和一样好

.65 might be as good as it is going to get

.7很好

.8不太可能,并且

.9可能是错误的

低相干性修复程序:


  • 调整参数alpha = .1,beta = .01或.001,random_state = 123等

  • adjust your parameters alpha = .1, beta = .01 or .001, random_state = 123, etc

获取更好的数据

您可能有错误的主题数,请查看 https://datascienceplus.com/evaluation-of-topic-modeling-topic-coherence/ 称为肘方法-它为您提供了最佳主题数的图表,可最大程度地保持数据集中的一致性。我使用的槌具有很好的连贯性,这里的代码可以检查不同主题数的连贯性:

at .4 you probably have the wrong number of topics check out https://datascienceplus.com/evaluation-of-topic-modeling-topic-coherence/ for what is known as the elbow method - it gives you a graph of the optimal number of topics for greatest coherence in your data set. I'm using mallet which has pretty good coherance here is code to check coherence for different numbers of topics:

def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3):
    """
    Compute c_v coherence for various number of topics

    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    texts : List of input texts
    limit : Max num of topics

    Returns:
    -------
    model_list : List of LDA topic models
    coherence_values : Coherence values corresponding to the LDA model with respective number of topics
    """
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=num_topics, id2word=id2word)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())

    return model_list, coherence_values
# Can take a long time to run.
model_list, coherence_values = compute_coherence_values(dictionary=id2word, corpus=corpus, texts=data_lemmatized, start=2, limit=40, step=6)
# Show graph
limit=40; start=2; step=6;
x = range(start, limit, step)
plt.plot(x, coherence_values)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.legend(("coherence_values"), loc='best')
plt.show()

# Print the coherence scores
for m, cv in zip(x, coherence_values):
    print("Num Topics =", m, " has Coherence Value of", round(cv, 4))
    
# Select the model and print the topics
optimal_model = model_list[3]
model_topics = optimal_model.show_topics(formatted=False)
pprint(optimal_model.print_topics(num_words=10))

我希望这会有所帮助:)

I hope this helps :)

这篇关于相干分数0.4是好是坏?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆