星火MLlib LDA,如何推断出一个新的看不见的文档的主题分布? [英] Spark MLlib LDA, how to infer the topics distribution of a new unseen document?

查看:931
本文介绍了星火MLlib LDA,如何推断出一个新的看不见的文档的主题分布?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我很有兴趣申请使用星火MLlib LDA的主题造型。我已经检查了code和中的这里,但我怎么也找不到使用该模型,然后找到新的看不见的文档中的主题分布。

i am interested in applying LDA topic modelling using Spark MLlib. I have checked the code and the explanations in here but I couldn't find how to use the model then to find the topic distribution in a new unseen document.

推荐答案

由于星火1.5这一功能尚未实施了 DistributedLDAModel 。你会需要做的是你的模型转换为 LocalLDAModel 使用 toLocal 方法,然后再调用 topicDistributions(文件:RDD [(长,VECTOR])方法,其中文件是新的(即出OF-培训)的文件,像这样:

As of Spark 1.5 this functionality has not been implemented for the DistributedLDAModel. What you're going to need to do is convert your model to a LocalLDAModel using the toLocal method and then call the topicDistributions(documents: RDD[(Long, Vector]) method where documents are the new (i.e. out-of-training) documents, something like this:

newDocuments: RDD[(Long, Vector)] = ...
val topicDistributions = distLDA.toLocal.topicDistributions(newDocuments)

此将是比EM算法不太准确该<一href=\"http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.3584&rep=rep1&type=pdf\">this纸的暗示的,但它的工作。或者,你可以只使用已经导致 LocalLDAModel 新的在线变EM训练算法。除了速度更快,这种新的算法也preferable由于这样的事实,它不像拟合 DistributedLDAModels 老EM算法,是优化参数(阿尔法狄利克雷的)比之前的话题混合的文档权重。据瓦拉赫等。人的阿尔法优化是pretty获得好的选题很重要的。

This is going to be less accurate than the EM algorithm that this paper suggests, but it will work. Alternatively, you could just use the new online variational EM training algorithm which already results in a LocalLDAModel. In addition to being faster, this new algorithm is also preferable due to the fact that it, unlike the older EM algorithm for fitting DistributedLDAModels, is optimizing the parameters (alphas) of the Dirichlet prior over the topic mixing weights for the documents. According to Wallach, et. al., optimization of the alphas is pretty important for obtaining good topics.

这篇关于星火MLlib LDA,如何推断出一个新的看不见的文档的主题分布?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆