了解Gensim LDA模型中的参数 [英] Understanding parameters in Gensim LDA Model
问题描述
我正在使用gensim.models.ldamodel.LdaModel
执行LDA,但是我不理解某些参数,因此无法在文档中找到解释.如果有人有使用此功能的经验,我将希望进一步详细说明这些参数所代表的含义.
具体来说,我不明白:
I am using gensim.models.ldamodel.LdaModel
to perform LDA, but I do not understand some of the parameters and cannot find explanations in the documentation. If someone has experience working with this, I would love further details of what these parameters signify.
Specifically, I do not understand:
-
random_state
-
update_every
-
chunksize
-
passes
-
alpha
-
per_word_topics
random_state
update_every
chunksize
passes
alpha
per_word_topics
我正在处理500个文档的语料库,每个文档大约大约3-5页(由于机密性原因,我无法共享数据快照).目前,我已设置
I am working with a corpus of 500 documents which are roughly around 3-5 pages each (I unfortunately cannot share a snapshot of the data because of confidentiality reasons). Currently I have set
-
num_topics = 10
-
random_state = 100
-
update_every = 1
-
chunksize = 50
-
passes = 10
-
alpha = 'auto'
-
per_word_topics = True
num_topics = 10
random_state = 100
update_every = 1
chunksize = 50
passes = 10
alpha = 'auto'
per_word_topics = True
但这只是基于我所看到的示例,我不确定这对我的数据有多普遍.
but this is solely based off of an example I saw and I am not sure how generalizable that is to my data.
推荐答案
我想知道您是否已经看到此页面?
I wonder if you have seen this page?
无论哪种方式,让我为您解释一些事情.该方法使用的文档数量很少(在经过Wikipedia大小的数据源训练后,效果会更好).因此,结果将是相当粗糙的,您必须意识到这一点.这就是为什么您不应该针对大量主题的原因(您选择了10个,在您的情况下可能明智地增加到20个).
Either way, let me explain a few things for you. The number of documents you use is small for the method (it works much better when trained on a data source of the size of Wikipedia). Therefore the results will be rather crude and you have to be aware of that. This is why you should not aim for a large number of topics (you chose 10 which could maybe go sensibly up to 20 in your case).
其他参数:
-
random_state
-作为种子(如果您想精确地重复训练过程)
random_state
- this serves as a seed (in case you wanted to repeat exactly the training process)
chunksize
-一次要考虑的文档数(影响内存消耗)
chunksize
- number of documents to consider at once (affects the memory consumption)
update_every
-每update_every
chunksize
块(本质上,这是为了优化内存消耗)
update_every
- update the model every update_every
chunksize
chunks (essentially, this is for memory consumption optimization)
passes
-该算法应该遍历整个主体的次数
passes
- how many times the algorithm is supposed to pass over the whole corpus
alpha
-引用文档:
可以将设置为显式数组=您选择的优先级.它也是 支持不对称"和自动"的特殊值:前者使用 固定归一化不对称1.0/topicno先于,后者学习 直接从您的数据中获取非对称先验.
can be set to an explicit array = prior of your choice. It also support special values of `‘asymmetric’ and ‘auto’: the former uses a fixed normalized asymmetric 1.0/topicno prior, the latter learns an asymmetric prior directly from your data.
per_word_topics
-将其设置为True
允许提取给定单词的最可能主题.设置培训过程的方式是将每个单词分配给一个主题.否则,将省略没有指示性的词. phi_value
是引导该过程的另一个参数-它是一个单词是否被视为指示性单词的阈值.
per_word_topics
- setting this to True
allows for extraction of the most likely topics given a word. The training process is set in such a way that every word will be assigned to a topic. Otherwise, words that are not indicative are going to be omitted. phi_value
is another parameter that steers this process - it is a threshold for a word treated as indicative or not.
M中特别详细地描述了最佳训练过程参数. Hoffman等人,在线学习潜在的狄利克雷分配方法.
有关训练过程或模型的内存优化,请参见此博客文章.
For memory optimization of the training process or the model see this blog post.
这篇关于了解Gensim LDA模型中的参数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!