逻辑回归模型原始预测字段背后的pyspark 2.2.0概念 [英] pyspark 2.2.0 concept behind raw predictions field of logistic regression model

查看:236
本文介绍了逻辑回归模型原始预测字段背后的pyspark 2.2.0概念的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图理解Pyspark中从逻辑回归模型生成的输出的概念.

I was trying to understand the concept of the output generated from logistic regression model in Pyspark.

任何人都可以解释由逻辑回归模型生成的rawPrediction字段计算背后的概念吗? 谢谢.

Could anyone please explain the concept behind the rawPrediction field calculation generated from a logistic regression model? Thanks.

推荐答案

在旧版本的Spark中

In older versions of the Spark javadocs (e.g. 1.5.x), there used to be the following explanation:

原始"预测的含义在算法之间可能会有所不同,但是它直观地给出了对每个可能标签的置信度的度量(其中较大=更可信).

The meaning of a "raw" prediction may vary between algorithms, but it intuitively gives a measure of confidence in each possible label (where larger = more confident).

它在以后的版本中不存在,但是您仍然可以在Scala中找到它

It is not there in the later versions, but you can still find it in the Scala source code.

无论如何,除了不幸的措辞,对于逻辑回归案例,Spark ML中的rawPrecictions就是世界其他地方所说的

Anyway, and any unfortunate wording aside, the rawPrecictions in Spark ML, for the logistic regression case, is what the rest of the world call logits, i.e. the raw output of a logistic regression classifier, which is subsequently transformed into a probability score using the logistic function exp(x)/(1+exp(x)).

以下是玩具数据的示例:

Here is an example with toy data:

spark.version
# u'2.2.0'

from pyspark.ml.classification import LogisticRegression
from pyspark.ml.linalg import Vectors
from pyspark.sql import Row
df = sqlContext.createDataFrame([
     (0.0, Vectors.dense(0.0, 1.0)),
     (1.0, Vectors.dense(1.0, 0.0))], 
     ["label", "features"])
df.show()
# +-----+---------+
# |label| features|
# +-----+---------+
# |  0.0|[0.0,1.0]|
# |  1.0|[1.0,0.0]|
# +-----+---------+

lr = LogisticRegression(maxIter=5, regParam=0.01, labelCol="label")
lr_model = lr.fit(df)

test = sc.parallelize([Row(features=Vectors.dense(0.2, 0.5)),
                       Row(features=Vectors.dense(0.5, 0.2))]).toDF()
lr_result = lr_model.transform(test)
lr_result.show(truncate=False)

这是结果:

+---------+----------------------------------------+----------------------------------------+----------+ 
|features |                          rawPrediction |                            probability |prediction|
+---------+----------------------------------------+----------------------------------------+----------+ 
|[0.2,0.5]|[0.9894187891647654,-0.9894187891647654]|[0.7289731070426124,0.27102689295738763]|      0.0 |
|[0.5,0.2]|[-0.9894187891647683,0.9894187891647683]|[0.2710268929573871,0.728973107042613]  |      1.0 | 
+---------+----------------------------------------+----------------------------------------+----------+

现在让我们确认rawPrediction的逻辑函数给出了probability列:

Let's now confirm that the logistic function of rawPrediction gives the probability column:

import numpy as np

x1 = np.array([0.9894187891647654,-0.9894187891647654])
np.exp(x1)/(1+np.exp(x1))
# array([ 0.72897311, 0.27102689])

x2 = np.array([-0.9894187891647683,0.9894187891647683])
np.exp(x2)/(1+np.exp(x2))
# array([ 0.27102689, 0.72897311])

即确实是这样

因此,总结一下所有三(3)个输出列:

So, to summarize regarding all three (3) output columns:

  • rawPrediction是逻辑回归分类器的原始输出(长度等于类数的数组)
  • probability是将逻辑函数应用于rawPrediction(长度等于rawPrediction的数组)的结果
  • prediction是参数,其中数组probability取其最大值,并给出最可能的标签(单个数字)
  • rawPrediction is the raw output of the logistic regression classifier (array with length equal to the number of classes)
  • probability is the result of applying the logistic function to rawPrediction (array of length equal to that of rawPrediction)
  • prediction is the argument where the array probability takes its maximum value, and it gives the most probable label (single number)

这篇关于逻辑回归模型原始预测字段背后的pyspark 2.2.0概念的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆