如何在R中的二进制h2o GBM中为每个类获取不同的变量重要性? [英] How to get different Variable Importance for each class in a binary h2o GBM in R?

查看:119
本文介绍了如何在R中的二进制h2o GBM中为每个类获取不同的变量重要性?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用带h2o的GBM解决分类问题,以代替逻辑回归(GLM).数据中的非线性和相互作用使我认为GBM更合适.

I'm trying to explore the use of a GBM with h2o for a classification issue to replace a logistic regression (GLM). The non-linearity and interactions in my data make me think a GBM is more suitable.

我已经运行了基准GBM(请参见下文),并将AUC与Logistic回归的AUC进行了比较. GBM的性能要好得多.

I've ran a baseline GBM (see below) and compared the AUC against the AUC of the logistic regression. THe GBM performs much better.

在经典的线性逻辑回归中,人们将能够看到每个预测变量(x)对结果变量(y)的方向和影响.

In a classic linear logistic regression, one would be able to see the direction and effect of each of the predictors (x) on the outcome variable (y).

现在,我想以相同的方式评估估算GBM的可变重要性.

Now, I would like to evaluate the variable importance of the estimate GBM in the same way.

一个人如何获得这两个类别中每个类别的变量重要性?

我知道变量重要性与逻辑回归中的估计系数不同,但这将有助于我了解哪个预测变量会影响哪个类别.

I know that the variable importance is not the same as the estimated coefficient in a logistic regression, but it would help me to understand which predictor impacts what class.

其他人提出了类似的问题 ,但提供的答案不适用于H2O对象.

Others have asked similar questions, but the answers provided won't work for the H2O object.

非常感谢您的帮助.

example.gbm <- h2o.gbm(
  x = c("list of predictors"), 
  y = "binary response variable", 
  training_frame = data, 
  max_runtime_secs = 1800, 
  nfolds=5,
  stopping_metric = "AUC")

推荐答案

AFAIS,机器学习方法越强大,解释其背后发生的事情就越复杂.

AFAIS, the more powerful a machine learning method, the more complex to explain what's going on beneath it.

GBM方法的优点(正如您已经提到的)也给理解模型带来了困难.当GBM模型使用不同的值范围时,尤其是对于数值变量而言,这是正确的,有些可能会产生正面影响,而另一些会产生负面影响.

The advantages of GBM method (as you mentioned already) also bring in difficulties to understand the model. This is especailly true for numeric varialbes when a GBM model may utilise value ranges differently that some may have positive impacts whereas others have negative effects.

对于GLM,如果未指定任何交互,则数字变量将是单调的,因此您可以检查其正面或负面影响.

For GLM, when there is no interaction specified, a numeric variable would be monotonic, hence you can have positive or negative impact examed.

现在总的观点很困难,有什么方法可以分析模型?我们可以从2种方法开始:

Now that a total view is difficult, is there any method we can analyse the model? There are 2 methods we can start with:

h2o提供的h2o.partialplot给出每个变量的部分(即边际)效果,可以将其视为:

h2o provides h2o.partialplot that gives the partial (i.e. marginal) effect for each variable, which can be seen as the effect:

library(h2o)
h2o.init()
prostate.path <- system.file("extdata", "prostate.csv", package="h2o")
prostate.hex <- h2o.uploadFile(path = prostate.path, destination_frame = "prostate.hex")
prostate.hex[, "CAPSULE"] <- as.factor(prostate.hex[, "CAPSULE"] )
prostate.hex[, "RACE"] <- as.factor(prostate.hex[,"RACE"] )
prostate.gbm <- h2o.gbm(x = c("AGE","RACE"),
                       y = "CAPSULE",
                       training_frame = prostate.hex,
                       ntrees = 10,
                       max_depth = 5,
                       learn_rate = 0.1)
h2o.partialPlot(object = prostate.gbm, data = prostate.hex, cols = "AGE")

LIME程序包[ https://github.com/thomasp85/lime] 提供了功能检查每个观察值的变量贡献.幸运的是,这个r包已经支持h2o.

LIME package [https://github.com/thomasp85/lime] provides capability to check variables contribution for each of observations. Luckily, this r package supports h2o already.

这篇关于如何在R中的二进制h2o GBM中为每个类获取不同的变量重要性?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆