H2O R api:从网格搜索中检索最佳模型 [英] H2O R api: retrieving optimal model from grid search
问题描述
我在R中使用h2o
包(v 3.6.0),并且已经建立了网格搜索模型.现在,我正在尝试访问该模型,以最小化验证集上的MSE.在python的sklearn
中,使用RandomizedSearchCV
时很容易实现:
I'm using the h2o
package (v 3.6.0) in R, and I've built a grid search model. Now, I'm trying to access the model which minimizes MSE on the validation set. In python's sklearn
, this is easily achievable when using RandomizedSearchCV
:
## Pseudo code:
grid = RandomizedSearchCV(model, params, n_iter = 5)
grid.fit(X)
best = grid.best_estimator_
不幸的是,这在h2o中没有被证明是简单明了的.这是您可以重新创建的示例:
This, unfortunately, does not prove as straightforward in h2o. Here's an example you can recreate:
library(h2o)
## assume you got h2o initialized...
X <- as.h2o(iris[1:100,]) # Note: only using top two classes for example
grid <- h2o.grid(
algorithm = 'gbm',
x = names(X[,1:4]),
y = 'Species',
training_frame = X,
hyper_params = list(
distribution = 'bernoulli',
ntrees = c(25,50)
)
)
查看grid
会显示大量信息,包括以下部分:
Viewing grid
prints a wealth of information, including this portion:
> grid
ntrees distribution status_ok model_ids
50 bernoulli OK Grid_GBM_file1742e107fe5ba_csv_10.hex_11_model_R_1456492736353_16_model_1
25 bernoulli OK Grid_GBM_file1742e107fe5ba_csv_10.hex_11_model_R_1456492736353_16_model_0
通过一点点挖掘,您可以访问每个单独的模型并查看可以想象的每个指标:
With a bit of digging, you can access each individual model and view every metric imaginable:
> h2o.getModel(grid@model_ids[[1]])
H2OBinomialModel: gbm
Model ID: Grid_GBM_file1742e107fe5ba_csv_10.hex_11_model_R_1456492736353_18_model_1
Model Summary:
number_of_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves
1 50 4387 1 1 1.00000 2 2 2.00000
H2OBinomialMetrics: gbm
** Reported on training data. **
MSE: 1.056927e-05
R^2: 0.9999577
LogLoss: 0.003256338
AUC: 1
Gini: 1
Confusion Matrix for F1-optimal threshold:
setosa versicolor Error Rate
setosa 50 0 0.000000 =0/50
versicolor 0 50 0.000000 =0/50
Totals 50 50 0.000000 =0/100
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
1 max f1 0.996749 1.000000 0
2 max f2 0.996749 1.000000 0
3 max f0point5 0.996749 1.000000 0
4 max accuracy 0.996749 1.000000 0
5 max precision 0.996749 1.000000 0
6 max absolute_MCC 0.996749 1.000000 0
7 max min_per_class_accuracy 0.996749 1.000000 0
并进行 很多 的挖掘,您终于可以做到这一点:
And with a lot of digging, you can finally get to this:
> h2o.getModel(grid@model_ids[[1]])@model$training_metrics@metrics$MSE
[1] 1.056927e-05
这似乎是一项艰巨的工作,无法确定一个对于模型选择而言应该是顶级的指标.在我的情况下,我有一个包含数百个模型的网格,而我当前的hacky解决方案似乎并不十分"R风格":
This seems like a lot of kludgey work to get down to a metric that ought to be top-level for model selection. In my situation, I've got a grid with hundreds of models, and my current, hacky solution just doesn't seems very "R-esque":
model_select_ <- function(grid) {
model_ids <- grid@model_ids
min = Inf
best_model = NULL
for(model_id in model_ids) {
model <- h2o.getModel(model_id)
mse <- model@model$training_metrics@metrics$MSE
if(mse < min) {
min <- mse
best_model <- model
}
}
best_model
}
对于某些对于机器学习实践如此核心的东西来说,这似乎有点过头了,这让我感到奇怪的是,h2o不会有更清洁"的方法来提取最佳模型,或者至少没有模型度量.
This seems like overkill for something that is so core to the practice of machine learning, and it just strikes me as odd that h2o would not have a "cleaner" method of extracting the optimal model, or at least model metrics.
我错过了什么吗?是否没有开箱即用"的方法来选择最佳模型?
Am I missing something? Is there no "out of the box" method for selecting the best model?
推荐答案
是的,有一种简单的方法可以提取H2O网格搜索的顶部"模型.还有一些实用程序功能将提取您一直尝试访问的所有模型指标(例如h2o.mse
).有关如何执行这些操作的示例,请参见 h2o-r /demos 和 h2o-py/demos h2o-3 GitHub存储库上的子文件夹.
Yes, there is an easy way to extract the "top" model of an H2O grid search. There are also utility functions that will extract all the model metrics (e.g. h2o.mse
) that you have been trying to access. Examples of how to do these things can be found in the h2o-r/demos and h2o-py/demos subfolders on the h2o-3 GitHub repo.
因为您使用的是R,所以这是相关代码示例,其中包括网格搜索以及排序结果.您还可以在h2o.getGrid
函数的R文档中找到如何访问此信息.
Since you are using R, here is a relevant code example that includes a grid search, with sorted results. You can also find how to access this information in the R documentation for the h2o.getGrid
function.
打印出所有模型的auc,并按验证AUC进行排序:
Print out the auc for all of the models, sorted by validation AUC:
auc_table <- h2o.getGrid(grid_id = "eeg_demo_gbm_grid", sort_by = "auc", decreasing = TRUE)
print(auc_table)
以下是输出示例:
H2O Grid Details
================
Grid ID: eeg_demo_gbm_grid
Used hyper parameters:
- ntrees
- max_depth
- learn_rate
Number of models: 18
Number of failed models: 0
Hyper-Parameter Search Summary: ordered by decreasing auc
ntrees max_depth learn_rate model_ids auc
1 100 5 0.2 eeg_demo_gbm_grid_model_17 0.967771493797284
2 50 5 0.2 eeg_demo_gbm_grid_model_16 0.949609591795923
3 100 5 0.1 eeg_demo_gbm_grid_model_8 0.94941792664595
4 50 5 0.1 eeg_demo_gbm_grid_model_7 0.922075196552274
5 100 3 0.2 eeg_demo_gbm_grid_model_14 0.913785959685157
6 50 3 0.2 eeg_demo_gbm_grid_model_13 0.887706691652792
7 100 3 0.1 eeg_demo_gbm_grid_model_5 0.884064379717198
8 5 5 0.2 eeg_demo_gbm_grid_model_15 0.851187402678818
9 50 3 0.1 eeg_demo_gbm_grid_model_4 0.848921799270639
10 5 5 0.1 eeg_demo_gbm_grid_model_6 0.825662907513139
11 100 2 0.2 eeg_demo_gbm_grid_model_11 0.812030639460551
12 50 2 0.2 eeg_demo_gbm_grid_model_10 0.785379521713437
13 100 2 0.1 eeg_demo_gbm_grid_model_2 0.78299280750123
14 5 3 0.2 eeg_demo_gbm_grid_model_12 0.774673686150002
15 50 2 0.1 eeg_demo_gbm_grid_model_1 0.754834657912535
16 5 3 0.1 eeg_demo_gbm_grid_model_3 0.749285131682721
17 5 2 0.2 eeg_demo_gbm_grid_model_9 0.692702793188135
18 5 2 0.1 eeg_demo_gbm_grid_model_0 0.676144542037133
表中的第一行包含具有最佳AUC的模型,因此下面我们可以获取该模型并提取验证AUC:
The top row in the table contains the model with the best AUC, so below we can grab that model and extract the validation AUC:
best_model <- h2o.getModel(auc_table@model_ids[[1]])
h2o.auc(best_model, valid = TRUE)
为了使h2o.getGrid
函数能够按验证集上的度量进行排序,您实际上需要将h2o.grid
函数传递给validation_frame
.在上面的示例中,您没有传递validation_frame,因此无法在验证集的网格中评估模型.
In order for the h2o.getGrid
function to be able sort by a metric on the validation set, you need to actually pass the h2o.grid
function a validation_frame
. In your example above, you did not pass a validation_frame, so you can't evaluate the models in the grid on the validation set.
这篇关于H2O R api:从网格搜索中检索最佳模型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!