LDA:为什么要抽样以推断出新文件? [英] LDA: Why sampling for inference of a new document?

查看:87
本文介绍了LDA:为什么要抽样以推断出新文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给出一个标准的LDA模型,该模型具有1000个主题和数百万个文档,并通过Mallet/折叠的Gibbs采样器进行了训练:

Given a standard LDA model with few 1000 topics and few millions of documents, trained with Mallet / collapsed Gibbs sampler:

在推断新文档时:为什么不跳过采样而仅使用模型的术语-主题计数来确定新文档的主题分配?我了解在新文档上应用Gibbs采样是在考虑新文档的主题混合,这反过来又会影响主题的构成方式(β,词频分布).但是,由于在推断新文档时主题保持固定,因此我不明白为什么这应该是相关的.

When inferring a new document: Why not just skip sampling and simply use the term-topic counts of the model to determine the topic assignments of the new document? I understand that applying the Gibbs sampling on the new document is taking into account the topic mixture of the new document which in turn influence how topics are composed (beta, term-freq. distributions). However as topics are kept fixed when inferring a new document, i don't see why this should be relevant.

采样的一个问题是概率性质-有时会记录推断出的主题分配,在重复调用时会大相径庭.因此,我想了解抽样的理论和实践价值,而不是仅仅使用确定性方法.

An issue with sampling is the probabilistic nature - sometimes documents topic assignments inferred, greatly vary on repeated invocations. Therefore i would like to understand the theoretical and practical value of the sampling vs. just using a deterministic method.

感谢Ben

推荐答案

仅使用最后一个Gibbs示例的术语主题计数不是一个好主意.这种方法没有考虑主题结构:如果文档中有一个主题包含多个单词,则该主题可能包含更多单词[1].

Just using term topic counts of the last Gibbs sample is not a good idea. Such an approach doesn't take into account the topic structure: if a document has many words from one topic, it's likely to have even more words from that topic [1].

例如,说两个单词在两个主题中的概率相等.给定文档中第一个单词的主题分配会影响另一个单词的主题概率:另一个单词更可能与第一个单词在同一主题中.关系也以其他方式起作用.这种情况的复杂性就是为什么我们使用Gibbs采样之类的方法来估计此类问题的值.

For example, say two words have equal probabilities in two topics. The topic assignment of the first word in a given document affects the topic probability of the other word: the other word is more likely to be in the same topic as the first one. The relation works the other way also. The complexity of this situation is why we use methods like Gibbs sampling to estimate values for this sort of problem.

关于您对主题分配有所不同的评论,这无济于事,而且可以当作一件好事:如果单词的主题分配有所不同,您就不能依靠它.您所看到的是该单词在主题上的后验分布没有明确的获胜者,因此您应该进行一些特殊的布置:)

As for your comment on topic assignments varying, that can't be helped, and could be taken as a good thing: if a words topic assignment varies, you can't rely on it. What you're seeing is that the posterior distribution over topics for that word has no clear winner, so you should take a particular assignment with a grain of salt :)

[1]假设文档主题分布的先验beta是稀疏性,这通常是主题模型所选择的.

[1] assuming beta, the prior on document-topic distributions, encourages sparsity, as is usually chosen for topic models.

这篇关于LDA:为什么要抽样以推断出新文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆