如何在Python中调整gensim`LdaMulticore`的参数 [英] How to tune the parameters for gensim `LdaMulticore` in Python

查看:95
本文介绍了如何在Python中调整gensim`LdaMulticore`的参数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在运行 gensim LdaMulticore 包,以使用Python进行主题建模.我试图了解 LdaMulticore 中参数的含义,然后找到了提供有关参数用法的一些解释的网站.作为非专家,我很难直观地理解这些内容.我还参考了网站上的其他材料,但我想此页面将对每个参数提供相对完整的解释.
此页面

I was running gensim LdaMulticore package for the topic modelling using Python. I tried to understand the meaning of the parameters within LdaMulticore and found the website that provides some explanations on the usage of the parameters. As a non-expert, I have some difficulty understanding these intuitively. I also referred some other materials from the website but I guess this page gives relatively full explanations of every parameters.
This page

  1. chunksize 每个训练块中要使用的文档数.
    ->是否意味着它一次确定要分析(训练)多少个文档?
    更改 chunksize 编号会产生明显不同的结果吗?还是对运行时间有影响?
  1. chunksize Number of documents to be used in each training chunk.
    ->Does it mean that it determines how many documents to be analyzed (trained) at once?
    Does changing the chunksize number generate significantly different outcomes? or does it just matter to the running time?

2. alpha eta decay
->我一直在阅读说明,但根本听不懂.
有人可以给我一些有关这些内容的直观解释吗?何时需要调整这些内容?

2.alpha, eta, decay
->I kept reading the explanations but couldn't understand these at all.
Could someone give me some intuitive explanations on what these are about/when do I need to adjust these?

3.迭代
当推断语料库的主题分布时,通过语料库的最大迭代次数.
->似乎当我将Python设置为n时,Python遍历了整个语料库的n倍.因此,数字越大,分析的数据越多,但是花费的时间也越长.

3.iteration
Maximum number of iterations through the corpus when inferring the topic distribution of a corpus.
->It seems that Python goes over n times of the entire corpus when I set it to n. So the higher the number, the more data is analyzed but takes longer time.

4.随机状态
randomState 对象或种子以生成一个.对于重现性很有用.
->我看到人们通过放置一个随机数来设置它.但是随机状态是什么?

4.random state
Either a randomState object or a seed to generate one. Useful for reproducibility.
->I've seen people setting up this by putting a random number. But what is random state about?

推荐答案

我想知道您是否看到此答案?在那里,我提供了有关 chunksize alpha 的一些解释.此博客文章具有实用提示,也可以有所帮助.

I am wondering if you saw this answer? There I provide some explanation regarding chunksize and alpha. This blog post has practical tips and can be of help too.

简而言之: chunksize -计算期望"时将多少文档加载到内存中?在更新模型之前执行步骤.每个期望"指的是每个期望".期望最大化算法的步骤同时考虑了此文档数量并且仅在矩阵完成对块"的计算之后才更新矩阵.块的大小决定了过程的性能-一次存储在内存中的文档越多-越好.过多的小块也会影响数值精度,尤其是对于大量文档而言.

In short: chunksize - how many documents are loaded into memory while calculating "expectation" step before updating the model. Each "expectation" step of Expectation Maximization algorithm takes into account this number of documents at once and updates the matrix only after it finishes the calculation on the "chunk". Size of the chunk determines the performance of the process - the more documents in memory at once - the better. Overly small chunks also impact numerical accuracy, particularly for a very large number of documents.

alpha eta decay -这些与LDA算法严格相关,因此没有直观说明";除非您掌握了需要对贝叶斯方法有所了解的算法,否则期望最大化特别是

alpha, eta, decay - these are strictly linked to the LDA algorithm and there are no "intuitive explanations" unless you have a grasp of the algorithm which requires some understanding of Bayesian methods, Expectation Maximization in particular.

迭代-您不正确.该数字越高,算法遍历整个文档整个文档的次数就越多.因此,没有更多数据".它只是您提供的语料库,只会迭代更多次.

iteration - you are not correct. The higher the number the more times the algorithm goes through the whole set of documents. So there is no "more data". It is only the corpus you provide, only iterated over more times.

random_state -这是一个种子(如果您想完全重复训练过程,将种子设置为相同值就足够了,并且您将在该模型上接收相同模型相同的数据+其他参数).当您关心可重复性时,这很有用.

random_state - this serves as a seed (in case you wanted to repeat exactly the training process it is enough that you set the seed to the same value and you are going to receive the same model on the same data + other parameters). This is useful when you care about reproducibility.

这篇关于如何在Python中调整gensim`LdaMulticore`的参数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆