LDA predict的与星火新文档的准确性 [英] the accuracy of LDA predict for new documents with Spark

查看:566
本文介绍了LDA predict的与星火新文档的准确性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的工作与星火Mllib,现在正在做与LDA的东西。

但是,当我用星火提供的code(参见下文),以predict在训练predict的模型,结果(文档主题)使用的文件是在与相反的两极训练有素的文档主题的结果。

我不知道是什么原因造成的结果。

寻求帮助,这是我的code如下:

列车: $ lda.run(文集)语料库是RDD这样的: $ RDD [(长,向量)] 矢量包含的词汇,文字的指数,wordcounts。
predict:

  DEF predict(文件:RDD [(长,向量),ldaModel:LDAModel):数组[(长,向量)] = {
    VAR docTopicsWeight =新的Array [(长,向量)](documents.collect()。长)
    ldaModel匹配{
      案例localModel:LocalLDAModel =>
        docTopicsWeight = localModel.topicDistributions(文档).collect()
      案例distModel:DistributedLDAModel =>
        docTopicsWeight = distModel.toLocal.topicDistributions(文档).collect()
    }
    docTopicsWeight
  }


解决方案

我不知道,如果你的问题上,为什么你在你的code但从我所理解得到错误的担忧实际上,似乎首先你使用默认的Vector类人。其次,你不能在模型上直接使用案例类。你需要使用 isInstanceOf asInstanceOf 该方法。

  DEF predict(文件:RDD [(长,org.apache.spark.mllib.linalg.Vector)],ldaModel:LDAModel):数组[(长,组织.apache.spark.mllib.linalg.Vector)] = {    VAR docTopicsWeight =新的Array [(长,org.apache.spark.mllib.linalg.Vector)](documents.collect()。长)
    如果(ldaModel.isInstanceOf [LocalLDAModel]){
      docTopicsWeight = ldaModel.asInstanceOf [LocalLDAModel] .topicDistributions(文件).collect
    }否则如果(ldaModel.isInstanceOf [DistributedLDAModel]){
      docTopicsWeight = ldaModel.asInstanceOf [DistributedLDAModel] .toLocal.topicDistributions(文件).collect
    }
    docTopicsWeight}

I'm work with Mllib of Spark, and now is doing something with LDA.

But when I use the code provided by Spark(see bellow) to predict a Doc used in training the model, the result(document-topics) of predict is at opposite poles with the result of trained document-topics.

I don't know what caused the result.

Asking for help, and here is my code below:

train:$lda.run(corpus) the corpus is an RDD like this: $RDD[(Long, Vector)] the Vector contains vocabulary, index of words, wordcounts. predict:

    def predict(documents: RDD[(Long, Vector)], ldaModel: LDAModel):        Array[(Long, Vector)] = {
    var docTopicsWeight = new Array[(Long, Vector)](documents.collect().length)
    ldaModel match {
      case localModel: LocalLDAModel =>
        docTopicsWeight = localModel.topicDistributions(documents).collect()
      case distModel: DistributedLDAModel =>
        docTopicsWeight = distModel.toLocal.topicDistributions(documents).collect()
    }
    docTopicsWeight
  }

解决方案

I'm not sure if your question actually concerns on why you were getting errors on your code but from I have understand, it seems first that you were using the default Vector class. Secondly, you can't use case class on the model directly. You'll need to use the isInstanceOf and asInstanceOf method for that.

def predict(documents: RDD[(Long, org.apache.spark.mllib.linalg.Vector)], ldaModel: LDAModel): Array[(Long, org.apache.spark.mllib.linalg.Vector)] = {

    var docTopicsWeight = new Array[(Long, org.apache.spark.mllib.linalg.Vector)](documents.collect().length)
    if (ldaModel.isInstanceOf[LocalLDAModel]) {
      docTopicsWeight = ldaModel.asInstanceOf[LocalLDAModel].topicDistributions(documents).collect
    } else if (ldaModel.isInstanceOf[DistributedLDAModel]) {
      docTopicsWeight = ldaModel.asInstanceOf[DistributedLDAModel].toLocal.topicDistributions(documents).collect
    }
    docTopicsWeight

}

这篇关于LDA predict的与星火新文档的准确性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆