如何从 PySpark MultilayerPerceptronClassifier 获得分类概率? [英] How to get classification probabilities from PySpark MultilayerPerceptronClassifier?

查看:26
本文介绍了如何从 PySpark MultilayerPerceptronClassifier 获得分类概率?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 python 中使用 Spark 2.0.1,我的数据集在 DataFrame 中,所以我使用 ML(不是 MLLib)库进行机器学习.我有一个多层感知器分类器,但只有两个标签.

I'm using Spark 2.0.1 in python, my dataset is in DataFrame, so I'm using the ML (not MLLib) library for machine learning. I have a multilayer perceptron classifier and I have only two labels.

我的问题是,是否有可能不仅获得标签,而且(或仅)获得该标签的概率?不仅仅是每个输入的 0 或 1,而是 0.95 表示 0 和 0.05 表示 1.如果 MLP 无法做到这一点,但其他分类器可以做到,我可以更改分类器.我只使用了 MLP,因为我知道他们应该能够返回概率,但我在 PySpark 中找不到.

My question is, is it possible to get not only the labels, but also (or only) the probability for that label? Like not just 0 or 1 for every input, but something like 0.95 for 0 and 0.05 for 1. If this is not possible with MLP, but is possible with other classifier, I can change the classifier. I have only used MLP because I know they should be capable of returning the probability, but I can't find it in PySpark.

我发现了一个类似的话题,如何从 MultilayerPerceptronClassifier 获取分类概率?但是他们使用 Java 并且他们建议的解决方案在 python 中不起作用.

I have found a similar topic about this, How to get classification probabilities from MultilayerPerceptronClassifier? but they use Java and the solution they suggested doesn't work in python.

谢谢

推荐答案

确实,从 2.0 版本开始,Spark ML 中的 MLP 似乎没有提供分类概率;尽管如此,还有许多其他分类器这样做,即 Logistic 回归朴素贝叶斯决策树随机森林.这是第一个和最后一个的简短示例:

Indeed, as of version 2.0, MLP in Spark ML does not seem to provide classification probabilities; nevertheless, there are a number of other classifiers doing so, i.e. Logistic Regression, Naive Bayes, Decision Tree, and Random Forest. Here is a short example with the first and the last one:

from pyspark.ml.classification import LogisticRegression, RandomForestClassifier
from pyspark.ml.linalg import Vectors
from pyspark.sql import Row
df = sqlContext.createDataFrame([
     (0.0, Vectors.dense(0.0, 1.0)),
     (1.0, Vectors.dense(1.0, 0.0))], 
     ["label", "features"])
df.show()
# +-----+---------+ 
# |label| features| 
# +-----+---------+ 
# | 0.0 |[0.0,1.0]| 
# | 1.0 |[1.0,0.0]| 
# +-----+---------+

lr = LogisticRegression(maxIter=5, regParam=0.01, labelCol="label")
lr_model = lr.fit(df)

rf = RandomForestClassifier(numTrees=3, maxDepth=2, labelCol="label", seed=42)
rf_model = rf.fit(df)

# test data:
test = sc.parallelize([Row(features=Vectors.dense(0.2, 0.5)),
                       Row(features=Vectors.dense(0.5, 0.2))]).toDF()

lr_result = lr_model.transform(test)
lr_result.show()
# +---------+--------------------+--------------------+----------+
# | features|       rawPrediction|         probability|prediction|
# +---------+--------------------+--------------------+----------+
# |[0.2,0.5]|[0.98941878916476...|[0.72897310704261...|       0.0|
# |[0.5,0.2]|[-0.9894187891647...|[0.27102689295738...|       1.0|  
# +---------+--------------------+--------------------+----------+

rf_result = rf_model.transform(test)
rf_result.show()
# +---------+-------------+--------------------+----------+ 
# | features|rawPrediction|         probability|prediction| 
# +---------+-------------+--------------------+----------+ 
# |[0.2,0.5]|    [1.0,2.0]|[0.33333333333333...|       1.0| 
# |[0.5,0.2]|    [1.0,2.0]|[0.33333333333333...|       1.0| 
# +---------+-------------+--------------------+----------+

对于 MLlib,请参阅我的回答 这里;对于几个无证 &PySpark 分类的反直觉特征,请参阅我的相关博文.

For MLlib, see my answer here; for several undocumented & counter-intuitive features of PySpark classification, see my relevant blog post.

这篇关于如何从 PySpark MultilayerPerceptronClassifier 获得分类概率?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆