基于重要性的变量缩减 [英] Importance based variable reduction

查看:45
本文介绍了基于重要性的变量缩减的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在过滤模型中最不重要的变量时遇到了困难.我收到了一组包含 4,000 多个变量的数据,我被要求减少进入模型的变量数量.

I am facing a difficulty with filtering out the least important variables in my model. I received a set of data with more than 4,000 variables, and I have been asked to reduce the number of variables getting into the model.

我已经尝试过两种方法,但都失败了两次.

I did try already two approaches, but I have failed twice.

我尝试的第一件事是在建模后手动检查变量重要性,并在此基础上删除不重要的变量.

The first thing I tried was to manually check variable importance after the modelling and based on that removing non significant variables.

# reproducible example
data <- iris

# artificial class imbalancing
data <- iris %>% 
  mutate(Species = as.factor(ifelse(Species == "virginica", "1", "0"))) 

使用简单的Learner时一切正常:

Everything works fine while using simple Learner:

# creating Task
task <- TaskClassif$new(id = "score", backend = data, target = "Species", positive = "1")

# creating Learner
lrn <- lrn("classif.xgboost") 

# setting scoring as prediction type 
lrn$predict_type = "prob"

lrn$train(task)
lrn$importance()

 Petal.Width Petal.Length 
  0.90606304   0.09393696 

问题是数据高度不平衡,所以我决定使用 GraphLearnerPipeOp 运算符来对多数组进行欠采样,然后将其传递给 AutoTuner:

The issue is that the data is highly imbalanced, so I decided to use GraphLearner with PipeOp operator to undersample majority group which is then passed to AutoTuner:

我确实跳过了一些我认为对这种情况不重要的代码部分,例如搜索空间、终止符、调谐器等.

# undersampling
po_under <- po("classbalancing",
               id = "undersample", adjust = "major",
               reference = "major", shuffle = FALSE, ratio = 1 / 2)

# combine learner with pipeline graph
lrn_under <- GraphLearner$new(po_under %>>% lrn)

# setting the autoTuner
at <- AutoTuner$new(
  learner = lrn_under,
  resampling = resample,
  measure = measure,
  search_space = ps_under,
  terminator = terminator,
  tuner = tuner
)

at$train(task)

问题是,尽管重要性属性在 at 中仍然可见,但 $importance() 不可用.

The problem right know is that despite the importance property being still visable within at the $importance() in unavailable.

> at
<AutoTuner:undersample.classif.xgboost.tuned>
* Model: list
* Parameters: list()
* Packages: -
* Predict Type: prob
* Feature types: logical, integer, numeric, character, factor, ordered, POSIXct
* Properties: featureless, importance, missings, multiclass, oob_error, selected_features, twoclass, weights

所以我决定改变我的方法并尝试将过滤添加到Learner.这就是我更失败的地方.我首先查看了这个 mlr3book 博客 - https://mlr3book.mlr-org.com/fs.html.我尝试将 importance = "impurity" 添加到 Learner 中,就像在博客中一样,但 id 确实产生了错误.

So I decided to change my approach and try to add filtering into a Learner. And that's where I've failed even more. I have started by looking into this mlr3book blog - https://mlr3book.mlr-org.com/fs.html. I tried to add importance = "impurity" into Learner just like in the blog but id did yield an error.

> lrn <- lrn("classif.xgboost", importance = "impurity") 
Błąd w poleceniu 'instance[[nn]] <- dots[[i]]':
  nie można zmienić wartości zablokowanego połączenia dla 'importance'

基本上是这样的:

Error in 'instance[[nn]] <- dots[[i]]':  can't change value of blocked connection for 'importance'

我也尝试过使用 PipeOp 过滤来解决,但它也失败了.我相信如果没有 importance = "impurity",我将无法做到这一点.

I did also try to workaround with PipeOp filtering but it also failed miserably. I believe I won't be able to do it without importance = "impurity".

所以我的问题是,有没有办法实现我的目标?

So my question is, is there a way to achieve what I am aiming for?

此外,我将非常感谢解释为什么在建模之前可以按重要性进行过滤?不应该是基于模型结果吗?

In addition I would be greatly thankful for explaining why is filtering by importance possible before modeling? Shouldn't it be based on the model result?

推荐答案

无法访问 at 变量的 $importance 的原因是它是一个AutoTuner,它不直接提供变量重要性,仅包装"围绕正在调整的实际 Learner.

The reason why you can't access $importance of the at variable is that it is an AutoTuner, which does not directly offer variable importance and only "wraps" around the actual Learner being tuned.

经过训练的 GraphLearner 保存在您的 AutoTuner 中的 $learner 下:

The trained GraphLearner is saved inside your AutoTuner under $learner:

# get the trained GraphLearner, with tuned hyperparameters
graphlearner <- at$learner

这个对象也没有$importance().(理论上,一个 GraphLearner 可以包含多个 Learner,然后它甚至不知道该赋予哪个重要性!)

This object also does not have $importance(). (Theoretically, a GraphLearner could contain more than one Learner and then it wouldn't even know which importance to give!).

获取实际的 LearnerClassifXgboost 对象有点乏味,不幸的是,因为 R6"中的缺点mlr3 使用的对象系统:

Getting the actual LearnerClassifXgboost object is a bit tedious, unfortunately, because of shortcomings in the "R6" object system used by mlr3:

  1. 获取未经训练的Learner对象
  2. 获取Learner 训练好的状态并将其放入该对象
  1. Get the untrained Learner object
  2. get the trained state of the Learner and put it into that object

# get the untrained Learner
xgboostlearner <- graphlearner$graph$pipeops$classif.xgboost$learner

# put the trained model into the Learner
xgboostlearner$state <- graphlearner$model$classif.xgboost

现在可以查询重要性

xgboostlearner$importance()


您链接到的书中的示例在您的案例中不起作用,因为该书使用了 ranger Learner,而使用的是 xgboost.importance = "impurity" 特定于 ranger.


The example from the book that you link to does not work in your case because the book uses the ranger Learner, while are using xgboost. importance = "impurity" is specific to ranger.

这篇关于基于重要性的变量缩减的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆