从Pyspark LDA模型中提取文档主题矩阵 [英] Extract document-topic matrix from Pyspark LDA Model
问题描述
我已经通过Python API成功地在spark中训练了LDA模型:
I have successfully trained an LDA model in spark, via the Python API:
from pyspark.mllib.clustering import LDA
model=LDA.train(corpus,k=10)
这完全可以正常工作,但是我现在需要LDA模型的 document -topic矩阵,但是据我所知,我只能得到 word -topic,使用model.topicsMatrix()
.
This works completely fine, but I now need the document-topic matrix for the LDA model, but as far as I can tell all I can get is the word-topic, using model.topicsMatrix()
.
是否有某种方法可以从LDA模型中获取文档主题矩阵,如果没有,Spark中是否有另一种方法(除了从头实现LDA之外)来运行LDA模型,该方法将为我提供结果需要吗?
Is there some way to get the document-topic matrix from the LDA model, and if not, is there an alternative method (other than implementing LDA from scratch) in Spark to run an LDA model that will give me the result I need?
After digging around a bit, I found the documentation for DistributedLDAModel in the Java api, which has a topicDistributions()
that I think is just what I need here (but I'm 100% sure if the LDAModel in Pyspark is in fact a DistributedLDAModel under the hood...).
无论如何,我都可以像这样间接调用此方法,而不会出现任何明显的失败:
In any case, I am able to indirectly call this method like so, without any overt failures:
In [127]: model.call('topicDistributions')
Out[127]: MapPartitionsRDD[3156] at mapPartitions at PythonMLLibAPI.scala:1480
但是,如果我实际查看结果,我得到的只是字符串,告诉我结果实际上是一个Scala元组(我认为):
But if I actually look at the results, all I get are string telling me that the result is actually a Scala tuple (I think):
In [128]: model.call('topicDistributions').take(5)
Out[128]:
[{u'__class__': u'scala.Tuple2'},
{u'__class__': u'scala.Tuple2'},
{u'__class__': u'scala.Tuple2'},
{u'__class__': u'scala.Tuple2'},
{u'__class__': u'scala.Tuple2'}]
也许这通常是正确的方法,但是有没有办法获得实际结果?
Maybe this is generally the right approach, but is there way to get the actual results?
推荐答案
经过广泛研究,绝对不可能通过当前版本Spark(1.5.1)上的Python api实现.但是在Scala中,这非常简单(考虑到要在其上进行训练的RDD documents
):
After extensive research, this is definitely not possible via the Python api on the current version of Spark (1.5.1). But in Scala, it's fairly straightforward (given an RDD documents
on which to train):
import org.apache.spark.mllib.clustering.{LDA, DistributedLDAModel}
// first generate RDD of documents...
val numTopics = 10
val lda = new LDA().setK(numTopics).setMaxIterations(10)
val ldaModel = lda.run(documents)
# then convert to distributed LDA model
val distLDAModel = ldaModel.asInstanceOf[DistributedLDAModel]
然后获取文档主题分布非常简单:
Then getting the document topic distributions is as simple as:
distLDAModel.topicDistributions
这篇关于从Pyspark LDA模型中提取文档主题矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!