Spark MLlib中DataFrame的“rawPrediction"和“probability"列是什么意思? [英] What do columns ‘rawPrediction’ and ‘probability’ of DataFrame mean in Spark MLlib?
问题描述
在我训练了一个 LogisticRegressionModel 之后,我用它转换了测试数据 DF 并得到了预测 DF.然后当我调用 prediction.show() 时,输出列名称为:[label |特点 |原始预测 |概率|预测]
.我知道label
和featrues
是什么意思,但我应该如何理解rawPrediction|probability|prediction
?
After I trained a LogisticRegressionModel, I transformed the test data DF with it and get the prediction DF. And then when I call prediction.show(), the output column names are: [label | features | rawPrediction | probability | prediction]
. I know what label
and featrues
mean, but how should I understand rawPrediction|probability|prediction
?
推荐答案
注意:另请参阅desertnaut https://stackoverflow.com/a/52947815/1056563
Note: please also see the answer below by desertnaut https://stackoverflow.com/a/52947815/1056563
RawPrediction
通常是直接概率/置信度计算.来自 Spark 文档:
RawPrediction
is typically the direct probability/confidence calculation. From Spark docs:
每个可能标签的原始预测.生"的意思预测可能因算法而异,但它直观地给出了对每个可能标签的置信度度量(其中较大 = 更多自信).
Raw prediction for each possible label. The meaning of a "raw" prediction may vary between algorithms, but it intuitively gives a measure of confidence in each possible label (where larger = more confident).
Prediction
是找到 rawPrediction 的统计
argmax`:mode
的结果 - via
The Prediction
is the result of finding the statistical mode
of the rawPrediction - via
argmax`:
protected def raw2prediction(rawPrediction: Vector): Double =
rawPrediction.argmax
Probability
是每个类的条件概率
.这是scaladoc
:
给定原始预测,估计每个类别的概率,
就地进行计算.这些预测也称为类条件概率.
Estimate the probability of each class given the raw prediction,
doing the computation in-place. These predictions are also called class conditional probabilities.
实际计算取决于您使用的Classifier
.
The actual calculation depends on which Classifier
you are using.
决策树
将原始预测向量标准化为多项式概率向量,就位.
Normalize a vector of raw predictions to be a multinomial probability vector, in place.
它只是按类对实例求和,然后除以总实例数.
It simply sums by class across the instances and then divides by the total instance count.
class_k probability = Count_k/Count_Total
LogisticRegression
它使用逻辑公式
class_k probability: 1/(1 + exp(-rawPrediction_k))
朴素贝叶斯
class_k probability = exp(max(rawPrediction) - rawPrediction_k)
随机森林
class_k probability = Count_k/Count_Total
这篇关于Spark MLlib中DataFrame的“rawPrediction"和“probability"列是什么意思?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!