PySpark:如何从MultilayerPerceptronClassifier获得分类概率? [英] PySpark: How to get classification probabilities from MultilayerPerceptronClassifier?

查看:319
本文介绍了PySpark:如何从MultilayerPerceptronClassifier获得分类概率?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在python中使用Spark 2.0.1,我的数据集在DataFrame中,所以我使用ML(不是MLLib)库进行机器学习. 我有一个多层感知器分类器,我只有两个标签.

I'm using Spark 2.0.1 in python, my dataset is in DataFrame, so I'm using the ML (not MLLib) library for machine learning. I have a multilayer perceptron classifier and I have only two labels.

我的问题是,不仅可以获得标签,而且(或仅)获得该标签的概率是否可能?不仅每个输入都为0或1,而且0等于0.95,1等于0.05. 如果这在MLP中是不可能的,而在其他分类器中是可能的,那么我可以更改分类器.我之所以只使用MLP,是因为我知道它们应该能够返回概率,但是我在PySpark中找不到它.

My question is, is it possible to get not only the labels, but also (or only) the probability for that label? Like not just 0 or 1 for every input, but something like 0.95 for 0 and 0.05 for 1. If this is not possible with MLP, but is possible with other classifier, I can change the classifier. I have only used MLP because I know they should be capable of returning the probability, but I can't find it in PySpark.

我发现了与此类似的主题, 如何从MultilayerPerceptronClassifier获得分类概率? 但是他们使用Java,并且他们建议的解决方案在python中不起作用.

I have found a similar topic about this, How to get classification probabilities from MultilayerPerceptronClassifier? but they use Java and the solution they suggested doesn't work in python.

Thx

推荐答案

实际上,从2.0版开始,Spark ML中的MLP似乎没有提供分类的可能性.尽管如此,还有许多其他分类器,例如

Indeed, as of version 2.0, MLP in Spark ML does not seem to provide classification probabilities; nevertheless, there are a number of other classifiers doing so, i.e. Logistic Regression, Naive Bayes, Decision Tree, and Random Forest. Here is a short example with the first and the last one:

from pyspark.ml.classification import LogisticRegression, RandomForestClassifier
from pyspark.ml.linalg import Vectors
from pyspark.sql import Row
df = sqlContext.createDataFrame([
     (0.0, Vectors.dense(0.0, 1.0)),
     (1.0, Vectors.dense(1.0, 0.0))], 
     ["label", "features"])
df.show()
# +-----+---------+ 
# |label| features| 
# +-----+---------+ 
# | 0.0 |[0.0,1.0]| 
# | 1.0 |[1.0,0.0]| 
# +-----+---------+

lr = LogisticRegression(maxIter=5, regParam=0.01, labelCol="label")
lr_model = lr.fit(df)

rf = RandomForestClassifier(numTrees=3, maxDepth=2, labelCol="label", seed=42)
rf_model = rf.fit(df)

# test data:
test = sc.parallelize([Row(features=Vectors.dense(0.2, 0.5)),
                       Row(features=Vectors.dense(0.5, 0.2))]).toDF()

lr_result = lr_model.transform(test)
lr_result.show()
# +---------+--------------------+--------------------+----------+
# | features|       rawPrediction|         probability|prediction|
# +---------+--------------------+--------------------+----------+
# |[0.2,0.5]|[0.98941878916476...|[0.72897310704261...|       0.0|
# |[0.5,0.2]|[-0.9894187891647...|[0.27102689295738...|       1.0|  
# +---------+--------------------+--------------------+----------+

rf_result = rf_model.transform(test)
rf_result.show()
# +---------+-------------+--------------------+----------+ 
# | features|rawPrediction|         probability|prediction| 
# +---------+-------------+--------------------+----------+ 
# |[0.2,0.5]|    [1.0,2.0]|[0.33333333333333...|       1.0| 
# |[0.5,0.2]|    [1.0,2.0]|[0.33333333333333...|       1.0| 
# +---------+-------------+--------------------+----------+

对于MLlib,请参阅我的答案博客文章.

For MLlib, see my answer here; for several undocumented & counter-intuitive features of PySpark classification, see my relevant blog post.

这篇关于PySpark:如何从MultilayerPerceptronClassifier获得分类概率?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆