了解Gensim LDA模型中的参数 [英] Understanding parameters in Gensim LDA Model

查看：1402 发布时间：2020/4/30 8:38:08 python parameters gensim lda

本文介绍了了解Gensim LDA模型中的参数的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用gensim.models.ldamodel.LdaModel执行LDA，但是我不理解某些参数，因此无法在文档中找到解释.如果有人有使用此功能的经验，我将希望进一步详细说明这些参数所代表的含义. 具体来说，我不明白:

I am using gensim.models.ldamodel.LdaModel to perform LDA, but I do not understand some of the parameters and cannot find explanations in the documentation. If someone has experience working with this, I would love further details of what these parameters signify. Specifically, I do not understand:

random_state
update_every
chunksize
passes
alpha
per_word_topics

random_state
update_every
chunksize
passes
alpha
per_word_topics

我正在处理500个文档的语料库，每个文档大约大约3-5页(由于机密性原因，我无法共享数据快照).目前，我已设置

I am working with a corpus of 500 documents which are roughly around 3-5 pages each (I unfortunately cannot share a snapshot of the data because of confidentiality reasons). Currently I have set

num_topics = 10
random_state = 100
update_every = 1
chunksize = 50
passes = 10
alpha = 'auto'
per_word_topics = True

num_topics = 10
random_state = 100
update_every = 1
chunksize = 50
passes = 10
alpha = 'auto'
per_word_topics = True

但这只是基于我所看到的示例，我不确定这对我的数据有多普遍.

but this is solely based off of an example I saw and I am not sure how generalizable that is to my data.

推荐答案

我想知道您是否已经看到此页面?

I wonder if you have seen this page?

无论哪种方式，让我为您解释一些事情.该方法使用的文档数量很少(在经过Wikipedia大小的数据源训练后，效果会更好).因此，结果将是相当粗糙的，您必须意识到这一点.这就是为什么您不应该针对大量主题的原因(您选择了10个，在您的情况下可能明智地增加到20个).

Either way, let me explain a few things for you. The number of documents you use is small for the method (it works much better when trained on a data source of the size of Wikipedia). Therefore the results will be rather crude and you have to be aware of that. This is why you should not aim for a large number of topics (you chose 10 which could maybe go sensibly up to 20 in your case).

其他参数:

random_state-作为种子(如果您想精确地重复训练过程)

random_state - this serves as a seed (in case you wanted to repeat exactly the training process)

chunksize-一次要考虑的文档数(影响内存消耗)

chunksize - number of documents to consider at once (affects the memory consumption)

update_every -每update_every chunksize块(本质上，这是为了优化内存消耗)

update_every - update the model every update_every chunksize chunks (essentially, this is for memory consumption optimization)

passes-该算法应该遍历整个主体的次数

passes - how many times the algorithm is supposed to pass over the whole corpus

alpha-引用文档:

可以将
设置为显式数组=您选择的优先级.它也是支持不对称"和自动"的特殊值:前者使用固定归一化不对称1.0/topicno先于，后者学习直接从您的数据中获取非对称先验.

can be set to an explicit array = prior of your choice. It also support special values of `‘asymmetric’ and ‘auto’: the former uses a fixed normalized asymmetric 1.0/topicno prior, the latter learns an asymmetric prior directly from your data.

per_word_topics-将其设置为True允许提取给定单词的最可能主题.设置培训过程的方式是将每个单词分配给一个主题.否则，将省略没有指示性的词. phi_value是引导该过程的另一个参数-它是一个单词是否被视为指示性单词的阈值.

per_word_topics - setting this to True allows for extraction of the most likely topics given a word. The training process is set in such a way that every word will be assigned to a topic. Otherwise, words that are not indicative are going to be omitted. phi_value is another parameter that steers this process - it is a threshold for a word treated as indicative or not.

M中特别详细地描述了最佳训练过程参数. Hoffman等人，在线学习潜在的狄利克雷分配方法.

有关训练过程或模型的内存优化，请参见此博客文章.

For memory optimization of the training process or the model see this blog post.

这篇关于了解Gensim LDA模型中的参数的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

了解Gensim LDA模型中的参数 [英] Understanding parameters in Gensim LDA Model

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

了解Gensim LDA模型中的参数 [英] Understanding parameters in Gensim LDA Model

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭