随机森林的varImp(插入符号)和重要性(randomForest)之间的差异 [英] Difference between varImp (caret) and importance (randomForest) for Random Forest

查看:872
本文介绍了随机森林的varImp(插入符号)和重要性(randomForest)之间的差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不知道随机森林模型的varImp函数(caret程序包)和importance函数(randomForest程序包)之间的区别是什么

I do not understand which is the difference between varImp function (caret package) and importance function (randomForest package) for a Random Forest model:

我计算了一个简单的RF分类模型,当计算变量重要性时,我发现两个函数的预测变量的排名"不同:

I computed a simple RF classification model and when computing variable importance, I found that the "ranking" of predictors was not the same for both functions:

这是我的代码:

rfImp <- randomForest(Origin ~ ., data = TAll_CS,
                       ntree = 2000,
                       importance = TRUE)

importance(rfImp)

                                 BREAST       LUNG MeanDecreaseAccuracy MeanDecreaseGini
Energy_GLCM_R1SC4NG3        -1.44116806  2.8918537            1.0929302        0.3712622
Contrast_GLCM_R1SC4NG3      -2.61146974  1.5848150           -0.4455327        0.2446930
Entropy_GLCM_R1SC4NG3       -3.42017102  3.8839464            0.9779201        0.4170445
...

varImp(rfImp)
                                 BREAST        LUNG
Energy_GLCM_R1SC4NG3         0.72534283  0.72534283
Contrast_GLCM_R1SC4NG3      -0.51332737 -0.51332737
Entropy_GLCM_R1SC4NG3        0.23188771  0.23188771
...

我以为他们使用了相同的算法",但现在不确定.

I thought they used the same "algorithm" but I am not sure now.

编辑

为了重现该问题,可以使用ionosphere数据集(kknn程序包):

In order to reproduce the problem, the ionosphere dataset (kknn package) can be used:

library(kknn)
data(ionosphere)
rfImp <- randomForest(class ~ ., data = ionosphere[,3:35],
                       ntree = 2000,
                       importance = TRUE)
importance(rfImp)
             b        g MeanDecreaseAccuracy MeanDecreaseGini
V3  21.3106205 42.23040             42.16524        15.770711
V4  10.9819574 28.55418             29.28955         6.431929
V5  30.8473944 44.99180             46.64411        22.868543
V6  11.1880372 33.01009             33.18346         6.999027
V7  13.3511887 32.22212             32.66688        14.100210
V8  11.8883317 32.41844             33.03005         7.243705
V9  -0.5020035 19.69505             19.54399         2.501567
V10 -2.9051578 22.24136             20.91442         2.953552
V11 -3.9585608 14.68528             14.11102         1.217768
V12  0.8254453 21.17199             20.75337         3.298964
...

varImp(rfImp)
            b         g
V3  31.770511 31.770511
V4  19.768070 19.768070
V5  37.919596 37.919596
V6  22.099063 22.099063
V7  22.786656 22.786656
V8  22.153388 22.153388
V9   9.596522  9.596522
V10  9.668101  9.668101
V11  5.363359  5.363359
V12 10.998718 10.998718
...

我想我缺少了一些东西...

I think I am missing something...

编辑2

我发现,如果对importance(rfImp)的前两列的每一行取均值,则会得到varImp(rfImp)的结果:

I figured out that if you do the mean of each row of the first two columns of importance(rfImp), you get the results of varImp(rfImp):

impRF <- importance(rfImp)[,1:2]
apply(impRF, 1, function(x) mean(x))
       V3        V4        V5        V6        V7        V8        V9 
31.770511 19.768070 37.919596 22.099063 22.786656 22.153388  9.596522 
      V10       V11       V12 
 9.668101  5.363359 10.998718     ...

# Same result as in both columns of varImp(rfImp)

我不知道为什么会这样,但是对此必须有一个解释.

I do not know why this is happening, but there has to be an explanation for that.

推荐答案

如果我们遍历varImp的方法:

If we walk through the method for varImp:

检查对象:

> getFromNamespace('varImp','caret')
function (object, ...) 
{
    UseMethod("varImp")
}

获取S3方法:

> getS3method('varImp','randomForest')
function (object, ...) 
{
    code <- varImpDependencies("rf")
    code$varImp(object, ...)
}
<environment: namespace:caret>


code <- caret:::varImpDependencies('rf')

> code$varImp
function(object, ...){
                    varImp <- randomForest::importance(object, ...)
                    if(object$type == "regression")
                      varImp <- data.frame(Overall = varImp[,"%IncMSE"])
                    else {
                      retainNames <- levels(object$y)
                      if(all(retainNames %in% colnames(varImp))) {
                        varImp <- varImp[, retainNames]
                      } else {
                        varImp <- data.frame(Overall = varImp[,1])
                      }
                    }

                    out <- as.data.frame(varImp)
                    if(dim(out)[2] == 2) {
                      tmp <- apply(out, 1, mean)
                      out[,1] <- out[,2] <- tmp  
                    }
                    out
                  }

因此,这并非严格返回randomForest :: importance,

So this is not strictly returning randomForest::importance,

从计算开始,然后仅选择数据集中的分类值.

It starts by calculating that but then selects only the categorical values that are in the dataset.

然后它做一些有趣的事情,它检查我们是否只有两列:

Then it does something interesting, it checks if we only have two columns:

if(dim(out)[2] == 2) {
   tmp <- apply(out, 1, mean)
   out[,1] <- out[,2] <- tmp  
}


根据varImp手册页:


According to the varImp man page:

随机森林:varImp.randomForest和varImp.RandomForest是 围绕randomForest和 派对套餐.

Random Forest: varImp.randomForest and varImp.RandomForest are wrappers around the importance functions from the randomForest and party packages, respectively.

显然不是这种情况.

为什么...

如果我们只有两个值,则变量作为预测变量的重要性可以表示为一个值.

If we have only two values, the importance of the variable as a predictor can be represented as one value.

如果变量是g的预测变量,则它也必须是b

If the variable is a predictor of g, then it must also be a predictor of b

这确实是有道理的,但这并不适合他们的文档中有关该功能的作用,因此我很可能将其报告为意外行为.当您期望自己进行相对计算时,该功能将尝试提供帮助.

It does make sense, but this doesn't fit their documentation on what the function does, so I would likely report this as unexpected behavior. The function is attempting to assist when you're expecting to do the relative calculation yourself.

这篇关于随机森林的varImp(插入符号)和重要性(randomForest)之间的差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆