Spark MLlib中DataFrame的“rawPrediction"和“probability"列是什么意思? [英] What do columns ‘rawPrediction’ and ‘probability’ of DataFrame mean in Spark MLlib?

查看:77
本文介绍了Spark MLlib中DataFrame的“rawPrediction"和“probability"列是什么意思?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我训练了一个 LogisticRegressionModel 之后,我用它转换了测试数据 DF 并得到了预测 DF.然后当我调用 prediction.show() 时,输出列名称为:[label |特点 |原始预测 |概率|预测].我知道labelfeatrues 是什么意思,但我应该如何理解rawPrediction|probability|prediction?

After I trained a LogisticRegressionModel, I transformed the test data DF with it and get the prediction DF. And then when I call prediction.show(), the output column names are: [label | features | rawPrediction | probability | prediction]. I know what label and featrues mean, but how should I understand rawPrediction|probability|prediction?

推荐答案

注意:另请参阅desertnaut https://stackoverflow.com/a/52947815/1056563

Note: please also see the answer below by desertnaut https://stackoverflow.com/a/52947815/1056563

RawPrediction 通常是直接概率/置信度计算.来自 Spark 文档:

RawPrediction is typically the direct probability/confidence calculation. From Spark docs:

每个可能标签的原始预测.生"的意思预测可能因算法而异,但它直观地给出了对每个可能标签的置信度度量(其中较大 = 更多自信).

Raw prediction for each possible label. The meaning of a "raw" prediction may vary between algorithms, but it intuitively gives a measure of confidence in each possible label (where larger = more confident).

Prediction 是找到 rawPrediction 的统计 mode 的结果 - viaargmax`:

The Prediction is the result of finding the statistical mode of the rawPrediction - viaargmax`:

  protected def raw2prediction(rawPrediction: Vector): Double =
          rawPrediction.argmax

Probability 是每个类的条件概率.这是scaladoc:

给定原始预测,估计每个类别的概率,
就地进行计算.这些预测也称为类条件概率.

Estimate the probability of each class given the raw prediction,
doing the computation in-place. These predictions are also called class conditional probabilities.

实际计算取决于您使用的Classifier.

The actual calculation depends on which Classifier you are using.

决策树

将原始预测向量标准化为多项式概率向量,就位.

Normalize a vector of raw predictions to be a multinomial probability vector, in place.

它只是按类对实例求和,然后除以总实例数.

It simply sums by class across the instances and then divides by the total instance count.

 class_k probability = Count_k/Count_Total

LogisticRegression

它使用逻辑公式

 class_k probability: 1/(1 + exp(-rawPrediction_k))

朴素贝叶斯

 class_k probability = exp(max(rawPrediction) - rawPrediction_k)

随机森林

 class_k probability = Count_k/Count_Total

这篇关于Spark MLlib中DataFrame的“rawPrediction"和“probability"列是什么意思?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆