Spark MLlib LDA,如何推断新的未见文档的主题分布? [英] Spark MLlib LDA, how to infer the topics distribution of a new unseen document?

查看:26
本文介绍了Spark MLlib LDA,如何推断新的未见文档的主题分布?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对使用 Spark MLlib 应用 LDA 主题建模感兴趣.我已经检查了 here<中的代码和解释/a> 但我找不到如何使用该模型然后在一个新的看不见的文档中找到主题分布.

i am interested in applying LDA topic modelling using Spark MLlib. I have checked the code and the explanations in here but I couldn't find how to use the model then to find the topic distribution in a new unseen document.

推荐答案

从 Spark 1.5 开始,此功能尚未为 DistributedLDAModel 实现.您需要做的是使用 toLocal 方法将模型转换为 LocalLDAModel,然后调用 topicDistributions(documents: RDD[(Long,Vector]) 方法,其中 documents 是新的(即未训练的)文档,如下所示:

As of Spark 1.5 this functionality has not been implemented for the DistributedLDAModel. What you're going to need to do is convert your model to a LocalLDAModel using the toLocal method and then call the topicDistributions(documents: RDD[(Long, Vector]) method where documents are the new (i.e. out-of-training) documents, something like this:

newDocuments: RDD[(Long, Vector)] = ...
val topicDistributions = distLDA.toLocal.topicDistributions(newDocuments)

这将不如 this paper 建议,但它会起作用.或者,您可以使用新的在线变分 EM 训练算法,该算法已经生成 LocalLDAModel.除了速度更快之外,这种新算法也更可取,因为它与用于拟合 DistributedLDAModels 的旧 EM 算法不同,它在主题混合之前优化 Dirichlet 的参数(alpha)文件的权重.根据 Wallach 等.,alpha 的优化对于获得好的主题非常重要.

This is going to be less accurate than the EM algorithm that this paper suggests, but it will work. Alternatively, you could just use the new online variational EM training algorithm which already results in a LocalLDAModel. In addition to being faster, this new algorithm is also preferable due to the fact that it, unlike the older EM algorithm for fitting DistributedLDAModels, is optimizing the parameters (alphas) of the Dirichlet prior over the topic mixing weights for the documents. According to Wallach, et. al., optimization of the alphas is pretty important for obtaining good topics.

这篇关于Spark MLlib LDA,如何推断新的未见文档的主题分布?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆