如何解释 h2o.predict() 结果的概率 (p0, p1) [英] How to interpret the probabilities (p0, p1) of the result of h2o.predict()

查看:17
本文介绍了如何解释 h2o.predict() 结果的概率 (p0, p1)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想了解 h2o.predict() 函数来自 H2o R-package.我意识到在某些情况下,当 predict 列是 1 时,p1 列的值低于 p0.我对 p0p1 列的解释是指每个事件的概率,所以我预计 predict=1p1 的概率 应该高于相反事件的概率 (p0),但它并不总是发生,正如我在以下示例中所示:使用 前列腺数据集.

I would like to understand the meaning of the value (result) of h2o.predict() function from H2o R-package. I realized that in some cases when the predict column is 1, the p1 column has a lower value than the column p0. My interpretation of p0 and p1 columns refer to the probabilities for each event, so I expected when predict=1 the probability of p1 should be higher than the probability of the opposite event (p0), but it doesn't occur always as I can show in the following example: using prostate dataset.

这是可执行的例子:

library(h2o)
h2o.init(max_mem_size = "12g", nthreads = -1)
prostate.hex <- h2o.importFile("https://h2o-public-test-data.s3.amazonaws.com/smalldata/prostate/prostate.csv")
prostate.hex$CAPSULE  <- as.factor(prostate.hex$CAPSULE)
prostate.hex$RACE     <- as.factor(prostate.hex$RACE)
prostate.hex$DCAPS    <- as.factor(prostate.hex$DCAPS)
prostate.hex$DPROS    <- as.factor(prostate.hex$DPROS)

prostate.hex.split = h2o.splitFrame(data = prostate.hex,
  ratios = c(0.70, 0.20, 0.10), seed = 1234)
train.hex     <- prostate.hex.split[[1]]
validate.hex  <- prostate.hex.split[[2]]
test.hex      <- prostate.hex.split[[3]]

fit <- h2o.glm(y = "CAPSULE", x = c("AGE", "RACE", "PSA", "DCAPS"),
  training_frame = train.hex,
  validation_frame = validate.hex,
  family = "binomial", nfolds = 0, alpha = 0.5)

prostate.predict = h2o.predict(object = fit, newdata = test.hex)
result <- as.data.frame(prostate.predict)
subset(result, predict == 1 & p1 < 0.4)

对于 subset 函数的结果,我得到以下输出:

I get the following output for the result of the subset function:

   predict        p0        p1
11       1 0.6355974 0.3644026
17       1 0.6153021 0.3846979
23       1 0.6289063 0.3710937
25       1 0.6007919 0.3992081
31       1 0.6239587 0.3760413

对于来自 test.hex 数据集的所有上述观察,预测为 1,但 p0 >p1.

For all the above observations from test.hex dataset the prediction is 1, but p0 > p1.

总观察,其中 predict=1p1 是:

The total observation where predict=1 but p1 < p0 is:

>   nrow(subset(result, predict == 1 & p1 < p0))
[1] 14

相反,没有 predict=0 其中 p0

>   nrow(subset(result, predict == 0 & p0 < p1))
[1] 0

这是predicttable信息的表格:

> table(result$predict)

 0  1 
18 23 

我们使用以下值作为决策变量 CAPSULE:

We are using as a decision variable CAPSULE with the following values:

> levels(as.data.frame(prostate.hex)$CAPSULE)
[1] "0" "1"

有什么建议吗?

注意:具有类似主题的问题:如何解释 h2o.predict 的结果 没有解决这个具体问题.

Note: The question with a similar topic: How to interpret results of h2o.predict does not address this specific issue.

推荐答案

似乎(另见 此处) 将 validation 数据集上的 F1 分数 最大化的阈值用作分类的默认阈值使用 h2o.glm().我们可以观察到:

It seems (also see here) that the threshold that maximizes the F1 score on the validation dataset is used as the default threshold for classification with h2o.glm(). We can observe the following:

  1. 在验证数据集上最大化F1 score 的阈值为0.363477.
  2. 所有预测 p1 概率小于该阈值的数据点都归为 0 类(预测为 0 类的数据点具有最高 p1 概率 = 0.3602365 0.363477).
  3. 所有预测p1概率大于该阈值的数据点被归类为1类(预测为1的数据点> 类具有最低的 p1 概率 = 0.3644026 > 0.363477).

  1. the threshold value that maximizes F1 score on the validation dataset is 0.363477.
  2. all datapoints with predicted p1 probability less than this threshold value are classified as 0 class (a datapoint predicted to be a 0 class has the highest p1 probability = 0.3602365 < 0.363477).
  3. all datapoints with predicted p1 probability greater than this threshold value are classified as 1 class (a datapoint predicted to be a 1 class has the lowest p1 probability = 0.3644026 > 0.363477).

min(result[result$predict==1,]$p1)
# [1] 0.3644026
max(result[result$predict==0,]$p1)
# [1] 0.3602365

# Thresholds found by maximizing the metrics on the training dataset
fit@model$training_metrics@metrics$max_criteria_and_metric_scores 
#Maximum Metrics: Maximum metrics at their respective thresholds
#                        metric threshold    value idx
#1                       max f1  0.314699 0.641975 200
#2                       max f2  0.215203 0.795148 262
#3                 max f0point5  0.451965 0.669856  74
#4                 max accuracy  0.451965 0.707581  74
#5                max precision  0.998285 1.000000   0
#6                   max recall  0.215203 1.000000 262
#7              max specificity  0.998285 1.000000   0
#8             max absolute_mcc  0.451965 0.395147  74
#9   max min_per_class_accuracy  0.360174 0.652542 127
#10 max mean_per_class_accuracy  0.391279 0.683269  97

# Thresholds found by maximizing the metrics on the validation dataset
fit@model$validation_metrics@metrics$max_criteria_and_metric_scores 
#Maximum Metrics: Maximum metrics at their respective thresholds
#                        metric threshold    value idx
#1                       max f1  0.363477 0.607143  33
#2                       max f2  0.292342 0.785714  51
#3                 max f0point5  0.643382 0.725806   9
#4                 max accuracy  0.643382 0.774194   9
#5                max precision  0.985308 1.000000   0
#6                   max recall  0.292342 1.000000  51
#7              max specificity  0.985308 1.000000   0
#8             max absolute_mcc  0.643382 0.499659   9
#9   max min_per_class_accuracy  0.379602 0.650000  28
#10 max mean_per_class_accuracy  0.618286 0.702273  11

result[order(result$predict),]
#   predict          p0        p1
#5        0 0.703274569 0.2967254
#6        0 0.639763460 0.3602365
#13       0 0.689557497 0.3104425
#14       0 0.656764541 0.3432355
#15       0 0.696248328 0.3037517
#16       0 0.707069611 0.2929304
#18       0 0.692137408 0.3078626
#19       0 0.701482762 0.2985172
#20       0 0.705973644 0.2940264
#21       0 0.701156961 0.2988430
#22       0 0.671778898 0.3282211
#24       0 0.646735016 0.3532650
#26       0 0.646582708 0.3534173
#27       0 0.690402957 0.3095970
#32       0 0.649945017 0.3500550
#37       0 0.804937468 0.1950625
#40       0 0.717706731 0.2822933
#41       0 0.642094040 0.3579060
#1        1 0.364577068 0.6354229
#2        1 0.503432724 0.4965673
#3        1 0.406771233 0.5932288
#4        1 0.551801718 0.4481983
#7        1 0.339600779 0.6603992
#8        1 0.002978593 0.9970214
#9        1 0.378034417 0.6219656
#10       1 0.596298925 0.4037011
#11       1 0.635597359 0.3644026
#12       1 0.552662241 0.4473378
#17       1 0.615302107 0.3846979
#23       1 0.628906297 0.3710937
#25       1 0.600791894 0.3992081
#28       1 0.216571552 0.7834284
#29       1 0.559174924 0.4408251
#30       1 0.489514642 0.5104854
#31       1 0.623958696 0.3760413
#33       1 0.504691497 0.4953085
#34       1 0.582509462 0.4174905
#35       1 0.504136056 0.4958639
#36       1 0.463076505 0.5369235
#38       1 0.510908093 0.4890919
#39       1 0.469376828 0.5306232

这篇关于如何解释 h2o.predict() 结果的概率 (p0, p1)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆