在 Spark RandomForestClassifier 中预测类别概率 [英] Predict Class Probabilities in Spark RandomForestClassifier

查看:27
本文介绍了在 Spark RandomForestClassifier 中预测类别概率的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 ml.classification.RandomForestClassifier 构建了随机森林模型.我试图从模型中提取预测概率,但我只看到了预测类而不是概率.根据此问题链接,问题已解决并导致此github 拉取请求 和 这个.但是,似乎在 1.5 版中已解决.我正在使用提供 Spark 1.4.1 的 AWS EMR 并且不知道如何获得预测概率.如果有人知道该怎么做,请分享您的想法或解决方案.谢谢!

I built random forest models using ml.classification.RandomForestClassifier. I am trying to extract the predict probabilities from the models but I only saw prediction classes instead of the probabilities. According to this issue link, the issue is resolved and it leads to this github pull request and this. However, It seems it's resolved in the version 1.5. I'm using the AWS EMR which provides Spark 1.4.1 and sill have no idea how to get the predict probabilities. If anyone knows how to do it, please share your thought or solutions. Thanks!

推荐答案

我已经回答过类似的问题之前.

I have already answered a similar question before.

不幸的是,使用 MLLIb,直到 1.4.1 版本,您才能获得分类模型的每个实例的概率.

Unfortunately, with MLLIb you can't get the probabilities per instance for classification models till version 1.4.1.

存在 JIRA 问题(SPARK-4362SPARK-6885) 关于这个确切的话题,我现在正在写答案.尽管如此,该问题似乎自 2014 年 11 月起就搁置了

There is JIRA issues (SPARK-4362 and SPARK-6885) concerning this exact topic which is IN PROGRESS as I'm writing the answer now. Nevertheless, the issue seems to be on hold since November 2014

目前无法在预测过程中使用朴素贝叶模型获得预测的后验概率.这应该与标签一起提供.

There is currently no way to get the posterior probability of a prediction with Naive Baye's model during prediction. This should be made available along with the label.

这里是@sean-owen 在邮件列表中关于朴素贝叶斯分类算法的类似主题的注释:

And here is a note from @sean-owen on the mailing list on a similar topic regarding the Naive Bayes classification algorithm:

最近在此邮件列表中讨论了这一点.你现在不能直接得到概率,但你可以稍微修改一下以获得 NaiveBayesModel 的内部数据结构并从那里计算它.

This was recently discussed on this mailing list. You can't get the probabilities out directly now, but you can hack a bit to get the internal data structures of NaiveBayesModel and compute it from there.

参考:来源.

此问题已在 Spark 1.5.0 中解决.有关详细信息,请参阅 JIRA 问题.

This issue has been resolved with Spark 1.5.0. Please refer to the JIRA issue for more details.

关于 AWS,您现在无能为力.一个解决方案可能是,如果你可以 fork emr-bootstrap-actions for spark 并根据需要对其进行配置,然后您就可以使用引导步骤在 AWS 上安装 Spark.

Concerning AWS, there is not much you can do now for that. A solution might be if you can fork the emr-bootstrap-actions for spark and configure it for you needs, then you'll be able to install Spark on AWS using the bootstrap step.

尽管如此,这似乎有点复杂.

Nevertheless, this might seem a little complicated.

有些事情你可能需要考虑:

There is some thing you might need to consider :

  • 更新 spark/config.file 以安装 spark-1.5.类似的东西:

  • update the spark/config.file to install you spark-1.5. Something like :

+3  1.5.0   python  s3://support.elasticmapreduce/spark/install-spark-script.py s3://path.to.your.bucket.spark.installation/spark/1.5.0/spark-1.5.0.tgz

  • 上面的这个文件列表必须是正确构建的 spark,位于您目前拥有的指定 s3 存储桶中.

  • this file list above must be a proper build of spark located in an specified s3 bucket you own for the time being.

    要构建您的 Spark,我建议您在示例中阅读它关于 building-spark-for-emr 的部分 以及 官方文档.应该是这样的!(我希望我没有忘记任何事情)

    To build your spark, I advice you reading about it in the examples section about building-spark-for-emr and also the official documentation. That should be about it! (I hope I haven't forgotten anything)

    Amazon EMR 版本 4.1.0 提供了 Apache Spark (1.5.0) 的升级版本.您可以在此处查看更多详情.强>

    EDIT : Amazon EMR release 4.1.0 offers an upgraded version of Apache Spark (1.5.0). You can check here for more details.

    这篇关于在 Spark RandomForestClassifier 中预测类别概率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆