H2O R api:从网格搜索中检索最佳模型 [英] H2O R api: retrieving optimal model from grid search

查看:268
本文介绍了H2O R api:从网格搜索中检索最佳模型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在R中使用h2o包(v 3.6.0),并且已经建立了网格搜索模型.现在,我正在尝试访问该模型,以最小化验证集上的MSE.在python的sklearn中,使用RandomizedSearchCV时很容易实现:

I'm using the h2o package (v 3.6.0) in R, and I've built a grid search model. Now, I'm trying to access the model which minimizes MSE on the validation set. In python's sklearn, this is easily achievable when using RandomizedSearchCV:

## Pseudo code:
grid = RandomizedSearchCV(model, params, n_iter = 5)
grid.fit(X)
best = grid.best_estimator_

不幸的是,这在h2o中没有被证明是简单明了的.这是您可以重新创建的示例:

This, unfortunately, does not prove as straightforward in h2o. Here's an example you can recreate:

library(h2o)
## assume you got h2o initialized...

X <- as.h2o(iris[1:100,]) # Note: only using top two classes for example 
grid <- h2o.grid(
    algorithm = 'gbm',
    x = names(X[,1:4]),
    y = 'Species',
    training_frame = X,
    hyper_params = list(
        distribution = 'bernoulli',
        ntrees = c(25,50)
    )
)

查看grid会显示大量信息,包括以下部分:

Viewing grid prints a wealth of information, including this portion:

> grid
ntrees distribution status_ok                                                                 model_ids
 50    bernoulli        OK Grid_GBM_file1742e107fe5ba_csv_10.hex_11_model_R_1456492736353_16_model_1
 25    bernoulli        OK Grid_GBM_file1742e107fe5ba_csv_10.hex_11_model_R_1456492736353_16_model_0

通过一点点挖掘,您可以访问每个单独的模型并查看可以想象的每个指标:

With a bit of digging, you can access each individual model and view every metric imaginable:

> h2o.getModel(grid@model_ids[[1]])
H2OBinomialModel: gbm
Model ID:  Grid_GBM_file1742e107fe5ba_csv_10.hex_11_model_R_1456492736353_18_model_1 
Model Summary: 
  number_of_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves
1              50                4387         1         1    1.00000          2          2     2.00000


H2OBinomialMetrics: gbm
** Reported on training data. **

MSE:  1.056927e-05
R^2:  0.9999577
LogLoss:  0.003256338
AUC:  1
Gini:  1

Confusion Matrix for F1-optimal threshold:
           setosa versicolor    Error    Rate
setosa         50          0 0.000000   =0/50
versicolor      0         50 0.000000   =0/50
Totals         50         50 0.000000  =0/100

Maximum Metrics: Maximum metrics at their respective thresholds
                      metric threshold    value idx
1                     max f1  0.996749 1.000000   0
2                     max f2  0.996749 1.000000   0
3               max f0point5  0.996749 1.000000   0
4               max accuracy  0.996749 1.000000   0
5              max precision  0.996749 1.000000   0
6           max absolute_MCC  0.996749 1.000000   0
7 max min_per_class_accuracy  0.996749 1.000000   0

并进行 很多 的挖掘,您终于可以做到这一点:

And with a lot of digging, you can finally get to this:

> h2o.getModel(grid@model_ids[[1]])@model$training_metrics@metrics$MSE
[1] 1.056927e-05

这似乎是一项艰巨的工作,无法确定一个对于模型选择而言应该是顶级的指标.在我的情况下,我有一个包含数百个模型的网格,而我当前的hacky解决方案似乎并不十分"R风格":

This seems like a lot of kludgey work to get down to a metric that ought to be top-level for model selection. In my situation, I've got a grid with hundreds of models, and my current, hacky solution just doesn't seems very "R-esque":

model_select_ <- function(grid) {
  model_ids <- grid@model_ids
  min = Inf
  best_model = NULL

  for(model_id in model_ids) {
    model <- h2o.getModel(model_id)
    mse <- model@model$training_metrics@metrics$MSE
    if(mse < min) {
      min <- mse
      best_model <- model
    }
  }

  best_model
}

对于某些对于机器学习实践如此核心的东西来说,这似乎有点过头了,这让我感到奇怪的是,h2o不会有更清洁"的方法来提取最佳模型,或者至少没有模型度量.

This seems like overkill for something that is so core to the practice of machine learning, and it just strikes me as odd that h2o would not have a "cleaner" method of extracting the optimal model, or at least model metrics.

我错过了什么吗?是否没有开箱即用"的方法来选择最佳模型?

Am I missing something? Is there no "out of the box" method for selecting the best model?

推荐答案

是的,有一种简单的方法可以提取H2O网格搜索的顶部"模型.还有一些实用程序功能将提取您一直尝试访问的所有模型指标(例如h2o.mse).有关如何执行这些操作的示例,请参见 h2o-r /demos h2o-py/demos h2o-3 GitHub存储库上的子文件夹.

Yes, there is an easy way to extract the "top" model of an H2O grid search. There are also utility functions that will extract all the model metrics (e.g. h2o.mse) that you have been trying to access. Examples of how to do these things can be found in the h2o-r/demos and h2o-py/demos subfolders on the h2o-3 GitHub repo.

因为您使用的是R,所以这是相关代码示例,其中包括网格搜索以及排序结果.您还可以在h2o.getGrid函数的R文档中找到如何访问此信息.

Since you are using R, here is a relevant code example that includes a grid search, with sorted results. You can also find how to access this information in the R documentation for the h2o.getGrid function.

打印出所有模型的auc,并按验证AUC进行排序:

Print out the auc for all of the models, sorted by validation AUC:

auc_table <- h2o.getGrid(grid_id = "eeg_demo_gbm_grid", sort_by = "auc", decreasing = TRUE)
print(auc_table)

以下是输出示例:

H2O Grid Details
================

Grid ID: eeg_demo_gbm_grid 
Used hyper parameters: 
  -  ntrees 
  -  max_depth 
  -  learn_rate 
Number of models: 18 
Number of failed models: 0 

Hyper-Parameter Search Summary: ordered by decreasing auc
   ntrees max_depth learn_rate                  model_ids               auc
1     100         5        0.2 eeg_demo_gbm_grid_model_17 0.967771493797284
2      50         5        0.2 eeg_demo_gbm_grid_model_16 0.949609591795923
3     100         5        0.1  eeg_demo_gbm_grid_model_8  0.94941792664595
4      50         5        0.1  eeg_demo_gbm_grid_model_7 0.922075196552274
5     100         3        0.2 eeg_demo_gbm_grid_model_14 0.913785959685157
6      50         3        0.2 eeg_demo_gbm_grid_model_13 0.887706691652792
7     100         3        0.1  eeg_demo_gbm_grid_model_5 0.884064379717198
8       5         5        0.2 eeg_demo_gbm_grid_model_15 0.851187402678818
9      50         3        0.1  eeg_demo_gbm_grid_model_4 0.848921799270639
10      5         5        0.1  eeg_demo_gbm_grid_model_6 0.825662907513139
11    100         2        0.2 eeg_demo_gbm_grid_model_11 0.812030639460551
12     50         2        0.2 eeg_demo_gbm_grid_model_10 0.785379521713437
13    100         2        0.1  eeg_demo_gbm_grid_model_2  0.78299280750123
14      5         3        0.2 eeg_demo_gbm_grid_model_12 0.774673686150002
15     50         2        0.1  eeg_demo_gbm_grid_model_1 0.754834657912535
16      5         3        0.1  eeg_demo_gbm_grid_model_3 0.749285131682721
17      5         2        0.2  eeg_demo_gbm_grid_model_9 0.692702793188135
18      5         2        0.1  eeg_demo_gbm_grid_model_0 0.676144542037133

表中的第一行包含具有最佳AUC的模型,因此下面我们可以获取该模型并提取验证AUC:

The top row in the table contains the model with the best AUC, so below we can grab that model and extract the validation AUC:

best_model <- h2o.getModel(auc_table@model_ids[[1]])
h2o.auc(best_model, valid = TRUE)

为了使h2o.getGrid函数能够按验证集上的度量进行排序,您实际上需要将h2o.grid函数传递给validation_frame.在上面的示例中,您没有传递validation_frame,因此无法在验证集的网格中评估模型.

In order for the h2o.getGrid function to be able sort by a metric on the validation set, you need to actually pass the h2o.grid function a validation_frame. In your example above, you did not pass a validation_frame, so you can't evaluate the models in the grid on the validation set.

这篇关于H2O R api:从网格搜索中检索最佳模型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆