了解Gensim LDA模型中的参数 [英] Understanding parameters in Gensim LDA Model

查看:1402
本文介绍了了解Gensim LDA模型中的参数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用gensim.models.ldamodel.LdaModel执行LDA,但是我不理解某些参数,因此无法在文档中找到解释.如果有人有使用此功能的经验,我将希望进一步详细说明这些参数所代表的含义. 具体来说,我不明白:

I am using gensim.models.ldamodel.LdaModel to perform LDA, but I do not understand some of the parameters and cannot find explanations in the documentation. If someone has experience working with this, I would love further details of what these parameters signify. Specifically, I do not understand:

  • random_state
  • update_every
  • chunksize
  • passes
  • alpha
  • per_word_topics
  • random_state
  • update_every
  • chunksize
  • passes
  • alpha
  • per_word_topics

我正在处理500个文档的语料库,每个文档大约大约3-5页(由于机密性原因,我无法共享数据快照).目前,我已设置

I am working with a corpus of 500 documents which are roughly around 3-5 pages each (I unfortunately cannot share a snapshot of the data because of confidentiality reasons). Currently I have set

  • num_topics = 10
  • random_state = 100
  • update_every = 1
  • chunksize = 50
  • passes = 10
  • alpha = 'auto'
  • per_word_topics = True
  • num_topics = 10
  • random_state = 100
  • update_every = 1
  • chunksize = 50
  • passes = 10
  • alpha = 'auto'
  • per_word_topics = True

但这只是基于我所看到的示例,我不确定这对我的数据有多普遍.

but this is solely based off of an example I saw and I am not sure how generalizable that is to my data.

推荐答案

我想知道您是否已经看到此页面?

I wonder if you have seen this page?

无论哪种方式,让我为您解释一些事情.该方法使用的文档数量很少(在经过Wikipedia大小的数据源训练后,效果会更好).因此,结果将是相当粗糙的,您必须意识到这一点.这就是为什么您不应该针对大量主题的原因(您选择了10个,在您的情况下可能明智地增加到20个).

Either way, let me explain a few things for you. The number of documents you use is small for the method (it works much better when trained on a data source of the size of Wikipedia). Therefore the results will be rather crude and you have to be aware of that. This is why you should not aim for a large number of topics (you chose 10 which could maybe go sensibly up to 20 in your case).

其他参数:

  • random_state-作为种子(如果您想精确地重复训练过程)

  • random_state - this serves as a seed (in case you wanted to repeat exactly the training process)

chunksize-一次要考虑的文档数(影响内存消耗)

chunksize - number of documents to consider at once (affects the memory consumption)

update_every -每update_every chunksize块(本质上,这是为了优化内存消耗)

update_every - update the model every update_every chunksize chunks (essentially, this is for memory consumption optimization)

passes-该算法应该遍历整个主体的次数

passes - how many times the algorithm is supposed to pass over the whole corpus

alpha-引用文档:

可以将

设置为显式数组=您选择的优先级.它也是 支持不对称"和自动"的特殊值:前者使用 固定归一化不对称1.0/topicno先于,后者学习 直接从您的数据中获取非对称先验.

can be set to an explicit array = prior of your choice. It also support special values of `‘asymmetric’ and ‘auto’: the former uses a fixed normalized asymmetric 1.0/topicno prior, the latter learns an asymmetric prior directly from your data.

  • per_word_topics-将其设置为True允许提取给定单词的最可能主题.设置培训过程的方式是将每个单词分配给一个主题.否则,将省略没有指示性的词. phi_value是引导该过程的另一个参数-它是一个单词是否被视为指示性单词的阈值.

  • per_word_topics - setting this to True allows for extraction of the most likely topics given a word. The training process is set in such a way that every word will be assigned to a topic. Otherwise, words that are not indicative are going to be omitted. phi_value is another parameter that steers this process - it is a threshold for a word treated as indicative or not.

    M中特别详细地描述了最佳训练过程参数. Hoffman等人,在线学习潜在的狄利克雷分配方法.

    有关训练过程或模型的内存优化,请参见此博客文章.

    For memory optimization of the training process or the model see this blog post.

    这篇关于了解Gensim LDA模型中的参数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆