我们应该如何解释H2O预测函数的结果? [英] How should we interpret the results of the H2O predict function?

查看:178
本文介绍了我们应该如何解释H2O预测函数的结果?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经训练并存储了随机森林二进制分类模型.现在,我正在尝试使用此模型来模拟处理新的(样本外)数据.我的Python(Anaconda 3.6)代码是:

I have trained and stored a random forest binary classification model. Now I'm trying to simulate processing new (out-of-sample) data with this model. My Python (Anaconda 3.6) code is:

import h2o
import pandas as pd
import sys

localH2O = h2o.init(ip = "localhost", port = 54321, max_mem_size = "8G", nthreads = -1)
h2o.remove_all()

model_path = "C:/sm/BottleRockets/rf_model/DRF_model_python_1501621766843_28117";
model = h2o.load_model(model_path)

new_data = h2o.import_file(path="C:/sm/BottleRockets/new_data.csv")
print(new_data.head(10))

predict = model.predict(new_data)  # predict returns a data frame
print(predict.describe())
predicted = predict[0,0]
probability = predict[0,2]  # probability the prediction is a "1"

print('prediction: ', predicted, ', probability: ', probability)

运行此代码时,我得到:

When I run this code I get:

>>> import h2o
>>> import pandas as pd
>>> import sys
>>> localH2O = h2o.init(ip = "localhost", port = 54321, max_mem_size = "8G", nthreads = -1)
Checking whether there is an H2O instance running at http://localhost:54321. connected.
--------------------------  ------------------------------
H2O cluster uptime:         22 hours 22 mins
H2O cluster version:        3.10.5.4
H2O cluster version age:    18 days
H2O cluster name:           H2O_from_python_Charles_0fqq0c
H2O cluster total nodes:    1
H2O cluster free memory:    6.790 Gb
H2O cluster total cores:    8
H2O cluster allowed cores:  8
H2O cluster status:         locked, healthy
H2O connection url:         http://localhost:54321
H2O connection proxy:
H2O internal security:      False
Python version:             3.6.1 final
--------------------------  ------------------------------
>>> h2o.remove_all()
>>> model_path = "C:/sm/BottleRockets/rf_model/DRF_model_python_1501621766843_28117";
>>> model = h2o.load_model(model_path)
>>> new_data = h2o.import_file(path="C:/sm/BottleRockets/new_data.csv")

Parse progress: |█████████████████████████████████████████████████████████| 100%
>>> print(new_data.head(10))
  BoxRatio    Thrust    Velocity    OnBalRun    vwapGain
----------  --------  ----------  ----------  ----------
     1.502    55.044        0.38          37       0.845

[1 row x 5 columns]

>>> predict = model.predict(new_data)  # predict returns a data frame

drf prediction progress: |████████████████████████████████████████████████| 100%
>>> print(predict.describe())
Rows:1
Cols:3


         predict    p0                  p1
-------  ---------  ------------------  -------------------
type     enum       real                real
mins                0.8849431818181818  0.11505681818181818
mean                0.8849431818181818  0.11505681818181818
maxs                0.8849431818181818  0.11505681818181818
sigma               0.0                 0.0
zeros               0                   0
missing  0          0                   0
0        1          0.8849431818181818  0.11505681818181818
None
>>> predicted = predict[0,0]
>>> probability = predict[0,2]  # probability the prediction is a "1"
>>> print('prediction: ', predicted, ', probability: ', probability)
prediction:  1 , probability:  0.11505681818181818
>>> 

我对预测"数据帧的内容感到困惑.请告诉我标记为"p0"和"p1"的列中的数字是什么意思.我希望它们是概率,并且正如您从我的代码可以看到的那样,我正在尝试获得预测的分类(0或1)以及该分类正确的可能性.我的代码可以正确执行此操作吗?

I am confused by the contents of the "predict" data frame. Please tell me what the numbers in the columns labeled "p0" and "p1" mean. I hope they are probabilities, and as you can see by my code, I am trying to get the predicted classification (0 or 1) and a probability that this classification is correct. Does my code correctly do that?

任何评论将不胜感激. 查尔斯

Any comments will be greatly appreciated. Charles

推荐答案

p0是选择0类的可能性(介于0和1之间).

p0 is the probability (between 0 and 1) that class 0 is chosen.

p1是选择1类的概率(介于0和1之间).

p1 is the probability (between 0 and 1) that class 1 is chosen.

要记住的是,预测"是通过将阈值应用于p1来进行的.根据您要减少误报还是误报来选择该阈值点.不只是0.5.

The thing to keep in mind is that the "prediction" is made by applying a threshold to p1. That threshold point is chosen depending on whether you want to reduce false positives or false negatives. It's not just 0.5.

为预测"选择的阈值为max-F1.但是您可以自己提取p1并以任意方式对其设置阈值.

The threshold chosen for "the prediction" is max-F1. But you can extract out p1 yourself and threshold it any way you like.

这篇关于我们应该如何解释H2O预测函数的结果?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆