木槌主题建模-主题键输出参数 [英] Mallet topic modeling - topic keys output parameter

查看:88
本文介绍了木槌主题建模-主题键输出参数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在MALLET主题建模中,--output-topic-keys [FILENAME]选项在每个主题旁边输出一个参数,该参数在MALLET站点的教程中称为该主题的"Dirichlet parameter".

In MALLET topic modelling, the --output-topic-keys [FILENAME] option outputs beside each topic a parameter that in the tutorial in the MALLET site called "Dirichlet parameter " of the topic.

我想知道此参数代表什么?在LDA模型中是β吗?如果不是的话,它是什么意思和用途.

I want to know what does this parameter represent? is it β in the LDA model? and if not what is it and what is it's meaning and use.

我注意到,当我在生成主题模型时不使用参数优化选项时,该参数在2.0.7版本中与2.0.8版本中有所不同.我想知道为什么会发生这种差异.

I noted that when I don't use the parameter optimization option while generating the topic model, this parameter differs in version 2.0.7 than in version 2.0.8. I want to know why this difference happens.

这是版本2.0.7的输出

here's version 2.0.7 output

和2.0.8

我知道每次运行的输出都不同,但是我只关心此参数.

I know that the output differs by each run, but I am only concerned with this parameter.

推荐答案

在Mallet中使用的主题模型推理算法涉及为每个单词重复采样新的主题分配,同时保持所有其他单词的分配固定.控制此过程的因素是(1)当前单词类型出现在每个主题中的频率以及(2)每个主题出现在当前文档中的频率.平滑参数可确保对于任何主题这些值永远不会为零:对于第一个因子,beta,对于第二个因子,alpha.

The topic model inference algorithm used in Mallet involves repeatedly sampling new topic assignments for each word holding the assignments of all other words fixed. The factors that control this process are (1) how often the current word type appears in each topic and (2) how many times each topic appears in the current document. The smoothing parameters ensure that these values are never zero for any topic: beta for the first factor, alpha for the second.

您可以认为alpha参数在此处显示为每个主题中添加的虚构"单词的数量.在第一种情况下,主题0在每个文档中具有2.5个虚构的权重词.此参数的默认值最初为50/numTopics.较大的值鼓励模型在文档中具有更统一的主题分布,较小的值鼓励更多的稀疏性.一般的经验是50太大了,而5是更好的默认值.在2.0.8中对此进行了更改.

You can think of the alpha parameter being displayed here as the number of "imaginary" words in each topic that are added. In the first case, topic 0 has 2.5 imaginary words of weight in every document. The default value for this parameter was initially 50 / numTopics. Larger values encourage models to have more uniform topic distributions in documents, smaller values encourage more sparsity. The general experience was that 50 was too large, and that 5 is a better default. This was changed in 2.0.8.

默认设置是使所有主题的alpha权重相等.启用超参数优化后,这些值可能会有所不同.通常,您会发现,具有较大价值的主题将包含接近停用词",这在大多数文档中都很常见,并且内容不多.价值很小的主题通常是不寻常且与众不同的文档.中间的主题通常是最有趣的.

The default is to make the alpha weight equal for all topics. With hyperparameter optimization on, these values can vary. Usually what you will find is that a topic with a large value will contain "near stopwords" that are frequent in most documents and don't have much content. Topics with very small values are often unusual and distinctive documents. Topics in the middle are often the most interesting.

这篇关于木槌主题建模-主题键输出参数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆