逻辑回归模型原始预测字段背后的pyspark 2.2.0概念 [英] pyspark 2.2.0 concept behind raw predictions field of logistic regression model
问题描述
我试图理解Pyspark中从逻辑回归模型生成的输出的概念.
I was trying to understand the concept of the output generated from logistic regression model in Pyspark.
任何人都可以解释由逻辑回归模型生成的rawPrediction
字段计算背后的概念吗?
谢谢.
Could anyone please explain the concept behind the rawPrediction
field calculation generated from a logistic regression model?
Thanks.
推荐答案
In older versions of the Spark javadocs (e.g. 1.5.x), there used to be the following explanation:
原始"预测的含义在算法之间可能会有所不同,但是它直观地给出了对每个可能标签的置信度的度量(其中较大=更可信).
The meaning of a "raw" prediction may vary between algorithms, but it intuitively gives a measure of confidence in each possible label (where larger = more confident).
It is not there in the later versions, but you can still find it in the Scala source code.
无论如何,除了不幸的措辞,对于逻辑回归案例,Spark ML中的rawPrecictions
就是世界其他地方所说的物流功能 exp(x)/(1+exp(x))
.
Anyway, and any unfortunate wording aside, the rawPrecictions
in Spark ML, for the logistic regression case, is what the rest of the world call logits, i.e. the raw output of a logistic regression classifier, which is subsequently transformed into a probability score using the logistic function exp(x)/(1+exp(x))
.
以下是玩具数据的示例:
Here is an example with toy data:
spark.version
# u'2.2.0'
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.linalg import Vectors
from pyspark.sql import Row
df = sqlContext.createDataFrame([
(0.0, Vectors.dense(0.0, 1.0)),
(1.0, Vectors.dense(1.0, 0.0))],
["label", "features"])
df.show()
# +-----+---------+
# |label| features|
# +-----+---------+
# | 0.0|[0.0,1.0]|
# | 1.0|[1.0,0.0]|
# +-----+---------+
lr = LogisticRegression(maxIter=5, regParam=0.01, labelCol="label")
lr_model = lr.fit(df)
test = sc.parallelize([Row(features=Vectors.dense(0.2, 0.5)),
Row(features=Vectors.dense(0.5, 0.2))]).toDF()
lr_result = lr_model.transform(test)
lr_result.show(truncate=False)
这是结果:
+---------+----------------------------------------+----------------------------------------+----------+
|features | rawPrediction | probability |prediction|
+---------+----------------------------------------+----------------------------------------+----------+
|[0.2,0.5]|[0.9894187891647654,-0.9894187891647654]|[0.7289731070426124,0.27102689295738763]| 0.0 |
|[0.5,0.2]|[-0.9894187891647683,0.9894187891647683]|[0.2710268929573871,0.728973107042613] | 1.0 |
+---------+----------------------------------------+----------------------------------------+----------+
现在让我们确认rawPrediction
的逻辑函数给出了probability
列:
Let's now confirm that the logistic function of rawPrediction
gives the probability
column:
import numpy as np
x1 = np.array([0.9894187891647654,-0.9894187891647654])
np.exp(x1)/(1+np.exp(x1))
# array([ 0.72897311, 0.27102689])
x2 = np.array([-0.9894187891647683,0.9894187891647683])
np.exp(x2)/(1+np.exp(x2))
# array([ 0.27102689, 0.72897311])
即确实是这样
因此,总结一下所有三(3)个输出列:
So, to summarize regarding all three (3) output columns:
-
rawPrediction
是逻辑回归分类器的原始输出(长度等于类数的数组) -
probability
是将逻辑函数应用于rawPrediction
(长度等于rawPrediction
的数组)的结果 -
prediction
是参数,其中数组probability
取其最大值,并给出最可能的标签(单个数字)
rawPrediction
is the raw output of the logistic regression classifier (array with length equal to the number of classes)probability
is the result of applying the logistic function torawPrediction
(array of length equal to that ofrawPrediction
)prediction
is the argument where the arrayprobability
takes its maximum value, and it gives the most probable label (single number)
这篇关于逻辑回归模型原始预测字段背后的pyspark 2.2.0概念的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!