H2o GLM仅与某些预测变量交互 [英] H2o GLM interact only certain predictors

查看:91
本文介绍了H2o GLM仅与某些预测变量交互的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有兴趣在h2o.glm()中创建交互条件.但是我不想生成所有成对的交互.例如,在mtcars数据集中...我想将"mpg"与所有其他因素(例如"cyl","hp"和"disp")进行交互,但我不希望其他因素彼此进行交互(因此,我不需要disp_hp或disp_cyl).

I'm interested in creating interaction terms in h2o.glm(). But I do not want to generate all pairwise interactions. For example, in the mtcars dataset...I want to interact 'mpg' with all the other factors such as 'cyl','hp', and 'disp' but I don't want the other factors to interact with each other (so I don't want disp_hp or disp_cyl).

我该如何最好地使用h2o.glm()中的(interactions = interacts_list)参数来解决此问题?

How should I best approach this problem using the (interactions = interactions_list) parameter in h2o.glm() ?

谢谢

推荐答案

根据?h2o.glminteractions=参数采用:

要交互的预测变量列索引的列表.全部 将为列表计算成对组合.

A list of predictor column indices to interact. All pairwise combinations will be computed for the list.

您不希望所有成对的组合,只希望特定的组合.

You do not want all pairwise combinations, only specific ones.

不幸的是,R H2O API不提供公式接口.如果是这样,则可以像在普通R glm中那样以编程方式指定任意一组交互. 1

Unfortunately, the R H2O API does not provide a formula interface. If it did, then an arbitrary set of interactions would be possible to specify programatically, as in a vanilla R glm.1

一种解决方案是在模型中包含所有成对组合,然后通过将beta设置为0来抑制那些不需要的组合.

One solution is to include all pairwise combinations in the model and then suppress those you do not want by setting the betas equal to 0.

根据 glm文档beta_constraints=用于:

指定数据集以使用beta约束.使用选定的帧 约束系数向量以提供上限和下限 界限.数据集必须包含具有有效系数的名称列 名称.

Specify a dataset to use beta constraints. The selected frame is used to constraint the coefficient vector to provide upper and lower bounds. The dataset must contain a names column with valid coefficient names.

根据 H2O词汇表,该格式对于beta_constraints是:

带有[[names]列的data.frame或H2OParsedData对象, "lower_bounds","upper_bounds","beta_given"],其中每一行 对应于GLM中的预测变量. 名称"包含预测变量 名称,"lower_bounds"和"upper_bounds"分别是上限和下限 beta的界限,"beta_given"是为 测试版.

A data.frame or H2OParsedData object with the columns ["names", "lower_bounds","upper_bounds", "beta_given"], where each row corresponds to a predictor in the GLM. "names" contains the predictor names, "lower_bounds" and "upper_bounds" are the lower and upper bounds of beta, and "beta_given" is some supplied starting values for beta.

现在,我们知道如何填写beta_constraints数据框除外,以了解如何设置交互作用术语名称的格式. 有关互动的文档没有告诉我们会发生什么. 因此,让我们以通过H2O进行交互的示例为例,看看该交互被命名为什么.

Now we know how to fill out our beta_constraints data frame except for how to format the interaction term names. The doc on interactions doesn't tell us what to expect. So let's just run an example with interactions through H2O and see what the interactions get named.

library('h2o')
remoteH2O <- h2o.init(ip='xxx.xx.xx.xxx', startH2O=FALSE)

data(mtcars)

df1 <- as.h2o(mtcars, destination_frame = 'demo_mtcars')

target <- 'wt'
predictors <- c('mpg','cyl','hp','disp')

glm1 <- h2o.glm(x = predictors,
                y = target,
                training_frame = 'demo_mtcars',
                model_id = 'demo_glm',
                lambda = 0, # disable regularization, but your use case may vary
                standardize = FALSE, # we want to see the raw parameters, but your use case may vary
                interactions = predictors # create all interactions
                )
print(glm1) # output includes:
# Coefficients: glm coefficients
#        names coefficients
# 1  Intercept     4.336269
# 2    mpg_cyl     0.019558
# 3     mpg_hp     0.000156
# ..

因此,交互项的名称看起来像v1_v2.

So it looks like the interaction terms are getting named like v1_v2.

因此,我们将要隐藏的所有交互术语命名为setdiff(),而不是我们想要保留的术语.

So let's name all the interaction terms we want to suppress, using setdiff() against the terms we want to keep.

library(tidyr)
intx_terms_keep <- # see footnote 1 for explanation
  expand.grid(c('mpg'),c('cyl','hp','disp')) %>%
    unite(intx, Var1, Var2, sep='_') %>% unlist()

intx_terms_suppress <- setdiff( # suppress all interactions minus those we wish to keep
                             combn(predictors,2,FUN=paste,collapse='_'), 
                             intx_terms_keep
                            )
constraints <- data.frame(names=intx_terms_suppress, 
                          lower_bounds=0, 
                          upper_bounds=0, 
                          beta_given=0)

glm2 <- h2o.glm(x = predictors,
                y = target,
                training_frame = 'demo_mtcars',
                model_id = 'demo_glm',
                lambda = 0,
                standardize = FALSE, 
                interactions = predictors, # create all interactions
                beta_constraints = constraints
)
print(glm2) # output includes:
# Coefficients: glm coefficients
#        names coefficients
# 1  Intercept     3.405154
# 2    mpg_cyl    -0.012740
# 3     mpg_hp    -0.000250
# 4   mpg_disp     0.000066
# 5     cyl_hp     0.000000
# 6   cyl_disp     0.000000
# 7    hp_disp     0.000000
# 8        mpg    -0.018981
# 9        cyl     0.168820
# 10      disp     0.004070
# 11        hp     0.000501

如您所见,只有所需的交互作用项具有非零系数.其余的将被有效忽略. 但是,因为它们仍然是模型中的术语,它们可能会计入自由度并可能影响某些指标(即调整后的R平方).

As you can see, only the desired interaction terms have non-zero coefficients. The rest are effectively ignored. However, since they are still terms in the model, they may count towards degrees of freedom and may affect some of the metrics (i.e., adjusted R-squared).

正如@Darren Cook提到的那样,另一种解决方案是将交互作为变量预先创建在训练数据集中.

As @Darren Cook mentioned, another solution would be to pre-create the interactions as variables in the training dataset.

这种方法将确保不需要的交互不会计入自由度,并且不会影响您调整后的R平方.

This approach would ensure that the unwanted interactions do not count towards degrees of freedom and impact your adjusted R-squared.

在允许公式界面的普通R glm()中,我将使用expand.grid创建一个交互项字符串并将其包括在公式中.

In a vanilla R glm(), which allows the formula interface, I would use expand.grid to create a string of interaction terms and include it in the formula.

传递expand.grid两个向量-您想将v1中的所有术语与v2中的所有术语进行交互.

Pass expand.grid two vectors -- you want to interact all terms in v1 with all terms in v2.

要使用您的示例,您想将mpgcylhpdisp进行交互:

To use your example, you want to interact mpg with cyl, hp, and disp:

library(tidyr)
intx_term_string <- 
  expand.grid(c('mpg'),c('cyl','hp','disp')) %>%
    unite(intx, Var1, Var2, sep=':') %>% apply(2, paste, collapse='+')

这为您提供了一系列诸如"mpg:cyl+mpg:hp+mpg:disp"的交互术语,您可以将其粘贴到其他预测变量的字符串中(可能使用粘贴折叠),并使用as.formula()进行转换.

This gives you a string of interaction terms like "mpg:cyl+mpg:hp+mpg:disp" that you can paste into a string of other predictors (possibly using paste-collapse) and convert with as.formula().

这篇关于H2o GLM仅与某些预测变量交互的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆