R h2o.glm-max_active_predictors问题 [英] R h2o.glm - issue with max_active_predictors

查看:105
本文介绍了R h2o.glm-max_active_predictors问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用预先定义的活动预测变量的最大数量(非默认的max_active_predictors列)来估计h2o.glm模型.这是示例:

I wanted to estimate h2o.glm model with pre-defined maximum number of active predictors (non-default max_active_predictors column). Here is the example:

set.seed(123)

par1 <- matrix(c(100, 200, 300, 400, 40, 30, 20, 10), 4, 2)
par2 <- c(1000, 2000, 3000, 4000)

coef <- c(0.5, -0.5, 1, -1, 1.5, -1.5, 2, -2)

mat <- as.data.frame(cbind(apply(par1, 1, function(x) rnorm(1000, mean = x[1], sd = x[2])),
                           sapply(par2, function(x) rpois(1000, lambda = x))))
mat$Y <- as.numeric(t(coef %*% t(mat)))

h2o.init(nthreads = -1)
mat_h2o <- as.h2o(mat, "mat.h2o")

glm_base <- h2o.glm(x = setdiff(colnames(mat), "Y"), 
                    y = "Y",
                    training_frame = mat_h2o,
                    solver = "IRLSM",
                    family = "gaussian",
                    link = "family_default",
                    alpha = 1,
                    lambda_search = TRUE,
                    nlambdas = 10)

summary(glm_base)

glm_restr <- h2o.glm(x = setdiff(colnames(mat), "Y"), 
                     y = "Y",
                     training_frame = mat_h2o,
                     solver = "IRLSM",
                     family = "gaussian",
                     link = "family_default",
                     alpha = 1,
                     lambda_search = TRUE,
                     nlambdas = 10,
                     max_active_predictors = 3)

summary(glm_restr)

glm_base中的摘要看起来完全像我的感觉(八个非零预测变量),但是后者是违反直觉的(八个非零预测变量).如何强制算法将最终模型的复杂度限制为预定义的变量数.

Summary from glm_base looks exactly how I feel it should (eight non-zero predictors), but the latter is counter-intuitive (also eight non-zero predictors). How I can force the algorithm to restrict the complexity of final model to the predefined number of variables.

推荐答案

我认为这是一个错误. (确认,请参见 https://0xdata.atlassian.net/browse/PUBDEV-3455 )

I think this is a bug. (Confirmed, see https://0xdata.atlassian.net/browse/PUBDEV-3455)

当我做h2o.scoreHistory(glm_restr)时,我得到了:

When I did h2o.scoreHistory(glm_restr) I got:

Scoring History: 
            timestamp   duration iteration lambda predictors deviance_train
1 2016-09-21 09:25:29  0.000 sec         0  .46E2          4       9806.688
2 2016-09-21 09:25:29  0.052 sec         0  .17E2          7       1988.941
3 2016-09-21 09:25:29  0.100 sec         0   .6E1          9        294.884
4 2016-09-21 09:25:29  0.153 sec         0  .21E1          9         38.086
5 2016-09-21 09:25:29  0.203 sec         0  .77E0          9          4.919
6 2016-09-21 09:25:29  0.255 sec         0  .28E0          9          0.635
7 2016-09-21 09:25:30  0.307 sec         0   .1E0          9          0.082
8 2016-09-21 09:25:30  0.358 sec         0 .36E-1          9          0.011
9 2016-09-21 09:25:30  0.408 sec         0 .13E-1          9          0.001

即Lambda搜索的第一次迭代(lambda值为46)似乎已经扫过3并直接达到4.

I.e. the first iteration of lambda search, with a lambda value of 46, seems to have swept past 3 and gone straight to 4.

有了这个线索,我可以通过跳过lambda搜索并选择50的lambda来获得三个预测变量:

With that clue, I could get three predictors by skipping lambda search and choosing a lambda of 50:

glm_L50 <- h2o.glm(x = setdiff(colnames(mat), "Y"), 
     y = "Y",
     training_frame = mat_h2o,
     solver = "IRLSM",
     family = "gaussian",
     link = "family_default",
     alpha = 1,
     lambda = 50)

输出glm_L50表示:

GLM Model: summary
    family     link         regularization number_of_predictors_total
1 gaussian identity Lasso (lambda = 50.0 )                          8
  number_of_active_predictors number_of_iterations training_frame
1                           3                    0        mat.h2o

Coefficients: glm coefficients
      names coefficients standardized_coefficients
1 Intercept  -998.311697              -3657.657068
2        V1     0.000000                  0.000000
3        V2     0.000000                  0.000000
4        V3     0.000000                  0.000000
5        V4     0.000000                  0.000000
6        V5     0.000000                  0.000000
7        V6    -0.389528                -17.453935
8        V7     1.014556                 53.969163
9        V8    -1.229969                -81.328717

H2ORegressionMetrics: glm
** Reported on training data. **

MSE:  10921.23
RMSE:  104.5047
MAE:  83.98198
RMSLE:  NaN
Mean Residual Deviance :  10921.23
R^2 :  0.6932398
Null Deviance :35601860
Null D.o.F. :999
Residual Deviance :10921233
Residual D.o.F. :996
AIC :12146.34

这篇关于R h2o.glm-max_active_predictors问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆