如何解释h2o.predict()结果的概率(p0,p1) [英] How to interpret the probabilities (p0, p1) of the result of h2o.predict()
问题描述
我想了解 H2o R-package中的h2o.predict()函数.我意识到在某些情况下,当predict
列为1
时,p1
列的值比列p0
的值低.我对p0
和p1
列的解释是针对每个事件的概率,因此我预计,当predict=1
p1
的概率应高于相反事件(p0
)的概率时,但是并非总是如以下示例所示:使用
I would like to understand the meaning of the value (result) of h2o.predict() function from H2o R-package. I realized that in some cases when the predict
column is 1
, the p1
column has a lower value than the column p0
. My interpretation of p0
and p1
columns refer to the probabilities for each event, so I expected when predict=1
the probability of p1
should be higher than the probability of the opposite event (p0
), but it doesn't occur always as I can show in the following example: using prostate dataset.
这是可执行的示例:
library(h2o)
h2o.init(max_mem_size = "12g", nthreads = -1)
prostate.hex <- h2o.importFile("https://h2o-public-test-data.s3.amazonaws.com/smalldata/prostate/prostate.csv")
prostate.hex$CAPSULE <- as.factor(prostate.hex$CAPSULE)
prostate.hex$RACE <- as.factor(prostate.hex$RACE)
prostate.hex$DCAPS <- as.factor(prostate.hex$DCAPS)
prostate.hex$DPROS <- as.factor(prostate.hex$DPROS)
prostate.hex.split = h2o.splitFrame(data = prostate.hex,
ratios = c(0.70, 0.20, 0.10), seed = 1234)
train.hex <- prostate.hex.split[[1]]
validate.hex <- prostate.hex.split[[2]]
test.hex <- prostate.hex.split[[3]]
fit <- h2o.glm(y = "CAPSULE", x = c("AGE", "RACE", "PSA", "DCAPS"),
training_frame = train.hex,
validation_frame = validate.hex,
family = "binomial", nfolds = 0, alpha = 0.5)
prostate.predict = h2o.predict(object = fit, newdata = test.hex)
result <- as.data.frame(prostate.predict)
subset(result, predict == 1 & p1 < 0.4)
对于subset
函数的结果,我得到以下输出:
I get the following output for the result of the subset
function:
predict p0 p1
11 1 0.6355974 0.3644026
17 1 0.6153021 0.3846979
23 1 0.6289063 0.3710937
25 1 0.6007919 0.3992081
31 1 0.6239587 0.3760413
对于来自test.hex
数据集的所有上述观察结果,预测为1
,但为p0 > p1
.
For all the above observations from test.hex
dataset the prediction is 1
, but p0 > p1
.
其中predict=1
但p1 < p0
的总观察值为:
> nrow(subset(result, predict == 1 & p1 < p0))
[1] 14
相反,没有predict=0
其中p0 < p1
> nrow(subset(result, predict == 0 & p0 < p1))
[1] 0
这是predict
的table
信息的表:
> table(result$predict)
0 1
18 23
我们使用具有以下值的决策变量CAPSULE
:
We are using as a decision variable CAPSULE
with the following values:
> levels(as.data.frame(prostate.hex)$CAPSULE)
[1] "0" "1"
有什么建议吗?
注意:具有类似主题的问题:如何解释h2o.predict的结果不能解决这个特定问题.
Note: The question with a similar topic: How to interpret results of h2o.predict does not address this specific issue.
推荐答案
It seems (also see here) that the threshold that maximizes the F1 score
on the validation
dataset is used as the default threshold for classification with h2o.glm()
. We can observe the following:
- 在验证数据集上最大化
F1 score
的阈值是0.363477
. - 所有预测的
p1
概率小于此阈值的数据点都归为0
类(预测为0
类的数据点具有最高的p1
概率=0.3602365
<0.363477
). -
所有预测
p1
概率大于此阈值的数据点都归为1
类(预测为1
类的数据点具有最低的p1
概率=0.3644026
>0.363477
).
- the threshold value that maximizes
F1 score
on the validation dataset is0.363477
. - all datapoints with predicted
p1
probability less than this threshold value are classified as0
class (a datapoint predicted to be a0
class has the highestp1
probability =0.3602365
<0.363477
). all datapoints with predicted
p1
probability greater than this threshold value are classified as1
class (a datapoint predicted to be a1
class has the lowestp1
probability =0.3644026
>0.363477
).
min(result[result$predict==1,]$p1)
# [1] 0.3644026
max(result[result$predict==0,]$p1)
# [1] 0.3602365
# Thresholds found by maximizing the metrics on the training dataset
fit@model$training_metrics@metrics$max_criteria_and_metric_scores
#Maximum Metrics: Maximum metrics at their respective thresholds
# metric threshold value idx
#1 max f1 0.314699 0.641975 200
#2 max f2 0.215203 0.795148 262
#3 max f0point5 0.451965 0.669856 74
#4 max accuracy 0.451965 0.707581 74
#5 max precision 0.998285 1.000000 0
#6 max recall 0.215203 1.000000 262
#7 max specificity 0.998285 1.000000 0
#8 max absolute_mcc 0.451965 0.395147 74
#9 max min_per_class_accuracy 0.360174 0.652542 127
#10 max mean_per_class_accuracy 0.391279 0.683269 97
# Thresholds found by maximizing the metrics on the validation dataset
fit@model$validation_metrics@metrics$max_criteria_and_metric_scores
#Maximum Metrics: Maximum metrics at their respective thresholds
# metric threshold value idx
#1 max f1 0.363477 0.607143 33
#2 max f2 0.292342 0.785714 51
#3 max f0point5 0.643382 0.725806 9
#4 max accuracy 0.643382 0.774194 9
#5 max precision 0.985308 1.000000 0
#6 max recall 0.292342 1.000000 51
#7 max specificity 0.985308 1.000000 0
#8 max absolute_mcc 0.643382 0.499659 9
#9 max min_per_class_accuracy 0.379602 0.650000 28
#10 max mean_per_class_accuracy 0.618286 0.702273 11
result[order(result$predict),]
# predict p0 p1
#5 0 0.703274569 0.2967254
#6 0 0.639763460 0.3602365
#13 0 0.689557497 0.3104425
#14 0 0.656764541 0.3432355
#15 0 0.696248328 0.3037517
#16 0 0.707069611 0.2929304
#18 0 0.692137408 0.3078626
#19 0 0.701482762 0.2985172
#20 0 0.705973644 0.2940264
#21 0 0.701156961 0.2988430
#22 0 0.671778898 0.3282211
#24 0 0.646735016 0.3532650
#26 0 0.646582708 0.3534173
#27 0 0.690402957 0.3095970
#32 0 0.649945017 0.3500550
#37 0 0.804937468 0.1950625
#40 0 0.717706731 0.2822933
#41 0 0.642094040 0.3579060
#1 1 0.364577068 0.6354229
#2 1 0.503432724 0.4965673
#3 1 0.406771233 0.5932288
#4 1 0.551801718 0.4481983
#7 1 0.339600779 0.6603992
#8 1 0.002978593 0.9970214
#9 1 0.378034417 0.6219656
#10 1 0.596298925 0.4037011
#11 1 0.635597359 0.3644026
#12 1 0.552662241 0.4473378
#17 1 0.615302107 0.3846979
#23 1 0.628906297 0.3710937
#25 1 0.600791894 0.3992081
#28 1 0.216571552 0.7834284
#29 1 0.559174924 0.4408251
#30 1 0.489514642 0.5104854
#31 1 0.623958696 0.3760413
#33 1 0.504691497 0.4953085
#34 1 0.582509462 0.4174905
#35 1 0.504136056 0.4958639
#36 1 0.463076505 0.5369235
#38 1 0.510908093 0.4890919
#39 1 0.469376828 0.5306232
这篇关于如何解释h2o.predict()结果的概率(p0,p1)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!