在Spark MLlib中,DataFrame的"rawPrediction"和"probability"列是什么意思? [英] What do columns ‘rawPrediction’ and ‘probability’ of DataFrame mean in Spark MLlib?
问题描述
训练了LogisticRegressionModel之后,我用它转换了测试数据DF,并获得了预测DF.然后,当我调用Forecast.show()时,输出列名称为:[label | features | rawPrediction | probability | prediction]
.我知道label
和featrues
的含义,但是我应该如何理解rawPrediction|probability|prediction
?
After I trained a LogisticRegressionModel, I transformed the test data DF with it and get the prediction DF. And then when I call prediction.show(), the output column names are: [label | features | rawPrediction | probability | prediction]
. I know what label
and featrues
mean, but how should I understand rawPrediction|probability|prediction
?
推荐答案
RawPrediction
通常是直接概率/置信度计算.来自Spark文档:
RawPrediction
is typically the direct probability/confidence calculation. From Spark docs:
每个可能标签的原始预测. 原始"的意思 预测在算法之间可能会有所不同,但是直观地给出了 对每个可能标签的置信度的度量(较大=更多 有信心).
Raw prediction for each possible label. The meaning of a "raw" prediction may vary between algorithms, but it intuitively gives a measure of confidence in each possible label (where larger = more confident).
Prediction
是找到rawPrediction - via
argmax`的统计mode
的结果:
The Prediction
is the result of finding the statistical mode
of the rawPrediction - via
argmax`:
protected def raw2prediction(rawPrediction: Vector): Double =
rawPrediction.argmax
Probability
是每个类的conditional probability
.这是scaladoc
:
在给出原始预测的情况下估算每个类别的概率,
就地进行计算.这些预测也称为 类条件概率.
Estimate the probability of each class given the raw prediction,
doing the computation in-place. These predictions are also called class conditional probabilities.
实际计算取决于您所使用的Classifier
.
The actual calculation depends on which Classifier
you are using.
DecisionTree
DecisionTree
将原始预测向量归一化为 多项式概率向量.
Normalize a vector of raw predictions to be a multinomial probability vector, in place.
它只是简单地按实例对类进行求和,然后除以实例总数.
It simply sums by class across the instances and then divides by the total instance count.
class_k probability = Count_k/Count_Total
LogisticRegression
LogisticRegression
它使用逻辑公式
class_k probability: 1/(1 + exp(-rawPrediction_k))
Naive Bayes
Naive Bayes
class_k probability = exp(max(rawPrediction) - rawPrediction_k)
Random Forest
Random Forest
class_k probability = Count_k/Count_Total
这篇关于在Spark MLlib中,DataFrame的"rawPrediction"和"probability"列是什么意思?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!