如何从PySpark MultilayerPerceptronClassifier获取分类概率? [英] How to get classification probabilities from PySpark MultilayerPerceptronClassifier?
问题描述
我在python中使用Spark 2.0.1,我的数据集在DataFrame中,所以我使用ML(不是MLLib)库进行机器学习.我有一个多层感知器分类器,我只有两个标签.
I'm using Spark 2.0.1 in python, my dataset is in DataFrame, so I'm using the ML (not MLLib) library for machine learning. I have a multilayer perceptron classifier and I have only two labels.
我的问题是,不仅可以获得标签,而且(或仅)获得该标签的概率是否可能?不仅每个输入都为0或1,而且0等于0.95,1等于0.05.如果这在MLP中是不可能的,而在其他分类器中是可能的,那么我可以更改分类器.我之所以只使用MLP,是因为我知道它们应该能够返回概率,但是我在PySpark中找不到它.
My question is, is it possible to get not only the labels, but also (or only) the probability for that label? Like not just 0 or 1 for every input, but something like 0.95 for 0 and 0.05 for 1. If this is not possible with MLP, but is possible with other classifier, I can change the classifier. I have only used MLP because I know they should be capable of returning the probability, but I can't find it in PySpark.
我发现了与此类似的主题,如何从MultilayerPerceptronClassifier获得分类概率?但是他们使用Java,并且他们建议的解决方案在python中不起作用.
I have found a similar topic about this, How to get classification probabilities from MultilayerPerceptronClassifier? but they use Java and the solution they suggested doesn't work in python.
Thx
推荐答案
实际上,从2.0版开始,Spark ML中的MLP似乎没有提供分类的可能性.但是,还有许多其他分类器,例如朴素贝叶斯,随机森林.这是第一个和最后一个的简短示例:
Indeed, as of version 2.0, MLP in Spark ML does not seem to provide classification probabilities; nevertheless, there are a number of other classifiers doing so, i.e. Logistic Regression, Naive Bayes, Decision Tree, and Random Forest. Here is a short example with the first and the last one:
from pyspark.ml.classification import LogisticRegression, RandomForestClassifier
from pyspark.ml.linalg import Vectors
from pyspark.sql import Row
df = sqlContext.createDataFrame([
(0.0, Vectors.dense(0.0, 1.0)),
(1.0, Vectors.dense(1.0, 0.0))],
["label", "features"])
df.show()
# +-----+---------+
# |label| features|
# +-----+---------+
# | 0.0 |[0.0,1.0]|
# | 1.0 |[1.0,0.0]|
# +-----+---------+
lr = LogisticRegression(maxIter=5, regParam=0.01, labelCol="label")
lr_model = lr.fit(df)
rf = RandomForestClassifier(numTrees=3, maxDepth=2, labelCol="label", seed=42)
rf_model = rf.fit(df)
# test data:
test = sc.parallelize([Row(features=Vectors.dense(0.2, 0.5)),
Row(features=Vectors.dense(0.5, 0.2))]).toDF()
lr_result = lr_model.transform(test)
lr_result.show()
# +---------+--------------------+--------------------+----------+
# | features| rawPrediction| probability|prediction|
# +---------+--------------------+--------------------+----------+
# |[0.2,0.5]|[0.98941878916476...|[0.72897310704261...| 0.0|
# |[0.5,0.2]|[-0.9894187891647...|[0.27102689295738...| 1.0|
# +---------+--------------------+--------------------+----------+
rf_result = rf_model.transform(test)
rf_result.show()
# +---------+-------------+--------------------+----------+
# | features|rawPrediction| probability|prediction|
# +---------+-------------+--------------------+----------+
# |[0.2,0.5]| [1.0,2.0]|[0.33333333333333...| 1.0|
# |[0.5,0.2]| [1.0,2.0]|[0.33333333333333...| 1.0|
# +---------+-------------+--------------------+----------+
对于MLlib,请参阅我的答案博客文章.
For MLlib, see my answer here; for several undocumented & counter-intuitive features of PySpark classification, see my relevant blog post.
这篇关于如何从PySpark MultilayerPerceptronClassifier获取分类概率?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!