在Spark MLlib中,DataFrame的"rawPrediction"和"probability"列是什么意思? [英] What do columns ‘rawPrediction’ and ‘probability’ of DataFrame mean in Spark MLlib?

查看:1524
本文介绍了在Spark MLlib中,DataFrame的"rawPrediction"和"probability"列是什么意思?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

训练了LogisticRegressionModel之后,我用它转换了测试数据DF,并获得了预测DF.然后,当我调用Forecast.show()时,输出列名称为:[label | features | rawPrediction | probability | prediction].我知道labelfeatrues的含义,但是我应该如何理解rawPrediction|probability|prediction?

After I trained a LogisticRegressionModel, I transformed the test data DF with it and get the prediction DF. And then when I call prediction.show(), the output column names are: [label | features | rawPrediction | probability | prediction]. I know what label and featrues mean, but how should I understand rawPrediction|probability|prediction?

推荐答案

RawPrediction通常是直接概率/置信度计算.来自Spark文档:

RawPrediction is typically the direct probability/confidence calculation. From Spark docs:

每个可能标签的原始预测. 原始"的意思 预测在算法之间可能会有所不同,但是直观地给出了 对每个可能标签的置信度的度量(较大=更多 有信心).

Raw prediction for each possible label. The meaning of a "raw" prediction may vary between algorithms, but it intuitively gives a measure of confidence in each possible label (where larger = more confident).

Prediction是找到rawPrediction - via argmax`的统计mode的结果:

The Prediction is the result of finding the statistical mode of the rawPrediction - viaargmax`:

  protected def raw2prediction(rawPrediction: Vector): Double =
          rawPrediction.argmax

Probability是每个类的conditional probability.这是scaladoc:

在给出原始预测的情况下估算每个类别的概率,
就地进行计算.这些预测也称为 类条件概率.

Estimate the probability of each class given the raw prediction,
doing the computation in-place. These predictions are also called class conditional probabilities.

实际计算取决于您所使用的Classifier.

The actual calculation depends on which Classifier you are using.

DecisionTree

DecisionTree

将原始预测向量归一化为 多项式概率向量.

Normalize a vector of raw predictions to be a multinomial probability vector, in place.

它只是简单地按实例对类进行求和,然后除以实例总数.

It simply sums by class across the instances and then divides by the total instance count.

 class_k probability = Count_k/Count_Total

LogisticRegression

LogisticRegression

它使用逻辑公式

 class_k probability: 1/(1 + exp(-rawPrediction_k))

Naive Bayes

Naive Bayes

 class_k probability = exp(max(rawPrediction) - rawPrediction_k)

Random Forest

Random Forest

 class_k probability = Count_k/Count_Total

这篇关于在Spark MLlib中,DataFrame的"rawPrediction"和"probability"列是什么意思?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆