从Pyspark LDA模型中提取文档主题矩阵 [英] Extract document-topic matrix from Pyspark LDA Model

查看:592
本文介绍了从Pyspark LDA模型中提取文档主题矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经通过Python API成功地在spark中训练了LDA模型:

I have successfully trained an LDA model in spark, via the Python API:

from pyspark.mllib.clustering import LDA
model=LDA.train(corpus,k=10)

这完全可以正常工作,但是我现在需要LDA模型的 document -topic矩阵,但是据我所知,我只能得到 word -topic,使用model.topicsMatrix().

This works completely fine, but I now need the document-topic matrix for the LDA model, but as far as I can tell all I can get is the word-topic, using model.topicsMatrix().

是否有某种方法可以从LDA模型中获取文档主题矩阵,如果没有,Spark中是否有另一种方法(除了从头实现LDA之外)来运行LDA模型,该方法将为我提供结果需要吗?

Is there some way to get the document-topic matrix from the LDA model, and if not, is there an alternative method (other than implementing LDA from scratch) in Spark to run an LDA model that will give me the result I need?

深入研究之后,我找到了

After digging around a bit, I found the documentation for DistributedLDAModel in the Java api, which has a topicDistributions() that I think is just what I need here (but I'm 100% sure if the LDAModel in Pyspark is in fact a DistributedLDAModel under the hood...).

无论如何,我都可以像这样间接调用此方法,而不会出现任何明显的失败:

In any case, I am able to indirectly call this method like so, without any overt failures:

In [127]: model.call('topicDistributions')
Out[127]: MapPartitionsRDD[3156] at mapPartitions at PythonMLLibAPI.scala:1480

但是,如果我实际查看结果,我得到的只是字符串,告诉我结果实际上是一个Scala元组(我认为):

But if I actually look at the results, all I get are string telling me that the result is actually a Scala tuple (I think):

In [128]: model.call('topicDistributions').take(5)
Out[128]:
[{u'__class__': u'scala.Tuple2'},
 {u'__class__': u'scala.Tuple2'},
 {u'__class__': u'scala.Tuple2'},
 {u'__class__': u'scala.Tuple2'},
 {u'__class__': u'scala.Tuple2'}]

也许这通常是正确的方法,但是有没有办法获得实际结果?

Maybe this is generally the right approach, but is there way to get the actual results?

推荐答案

经过广泛研究,绝对不可能通过当前版本Spark(1.5.1)上的Python api实现.但是在Scala中,这非常简单(考虑到要在其上进行训练的RDD documents):

After extensive research, this is definitely not possible via the Python api on the current version of Spark (1.5.1). But in Scala, it's fairly straightforward (given an RDD documents on which to train):

import org.apache.spark.mllib.clustering.{LDA, DistributedLDAModel}

// first generate RDD of documents...

val numTopics = 10
val lda = new LDA().setK(numTopics).setMaxIterations(10)
val ldaModel = lda.run(documents)

# then convert to distributed LDA model
val distLDAModel = ldaModel.asInstanceOf[DistributedLDAModel]

然后获取文档主题分布非常简单:

Then getting the document topic distributions is as simple as:

distLDAModel.topicDistributions

这篇关于从Pyspark LDA模型中提取文档主题矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆