RandomForest 包中的负 %IncMSE 是什么意思? [英] What does negative %IncMSE in RandomForest package mean?

查看:113
本文介绍了RandomForest 包中的负 %IncMSE 是什么意思?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 RandomForest 解决回归问题.我使用 importance(rf,type=1) 来获取变量的 %IncMSE,其中一个具有负的 %IncMSE.这是否意味着该变量对模型不利?我在互联网上搜索了一些答案,但没有找到明确的答案.我在模型的摘要中也发现了一些奇怪的东西(附在下面),虽然我将 ntrees 定义为 800,但似乎只使用了一棵树.

I used RandomForest for a regression problem. I used importance(rf,type=1) to get the %IncMSE for the variables and one of them has a negative %IncMSE. Does this mean that this variable is bad for the model? I searched the Internet to get some answers but I didn't find a clear one. I also found something strange in the model's summary ( attached below), It seems that only one tree was used although I defined ntrees as 800.

型号:

rf<-randomForest(var1~va2+var3+..+var35,data=d7depo,ntree=800,keep.forest=FALSE, importance=TRUE)

summary(rf)
                Length Class  Mode     
call                6  -none- call     
type                1  -none- character
predicted       26917  -none- numeric  
mse               800  -none- numeric  
rsq               800  -none- numeric  
oob.times       26917  -none- numeric  
importance         70  -none- numeric  
importanceSD       35  -none- numeric  
localImportance     0  -none- NULL     
proximity           0  -none- NULL     
ntree               1  -none- numeric  
mtry                1  -none- numeric  
forest              0  -none- NULL     
coefs               0  -none- NULL     
y               26917  -none- numeric  
test                0  -none- NULL     
inbag               0  -none- NULL     
terms               3  terms  call 

推荐答案

问题 1 - 为什么 ntree 显示 1?:

Question 1 - why does ntree show 1?:

summary(rf) 显示包含在 rf 变量中的对象的长度.这意味着 rf$ntree 的长度为 1.如果您在控制台上键入 rf$tree,您将看到它显示 800.

summary(rf) shows you the length of the objects that are included in your rf variable. That means that rf$ntree is of length 1. If you type on your console rf$tree you will see that it shows 800.

问题 2 - 负的 %IncMSE 是否表示坏"变量?

IncMSE:
计算方法是最初计算整个模型的 MSE.我们称之为MSEmod.在此之后,对于每个变量(数据集中的列),这些值都被随机打乱(排列),以便创建坏"变量并计算新的 MSE.IE.想象一下,对于一列,您有 1、2、3、4、5 行.排列后,这些最终将随机变为 4,3,1,2,5.排列后(所有其他列保持完全相同,因为我们要检查 col1's 的重要性),正在计算模型的新 MSE,我们称之为 MSEcol1(以类似的方式,您将拥有 MSEcol2MSEcol3,但让我们保持简单,这里只处理 MSEcol1).我们预计,由于第二个 MSE 是使用完全随机的变量创建的,MSEcol1 将高于 MSEmod(MSE 越高越差).因此,当我们取两个 MSEcol1 - MSEmod 的差值时,我们通常期望一个正数.在您的情况下,负数表明随机变量效果更好,这表明该变量可能不够预测,即不重要.

IncMSE:
The way this is calculated is by computing the MSE of the whole model initially. Let's call this MSEmod. After this for each one of the variables (columns in your data set) the values are randomly shuffled (permuted) so that a "bad" variable is being created and a new MSE is being calculated. I.e. imagine for that for one column you had rows 1,2,3,4,5. After the permutation these will end up being 4,3,1,2,5 at random. After the permutation (all of the other columns remain exactly the same since we want to examine col1's importance), the new MSE of the model is being calculated, let's call it MSEcol1 (in a similar manner you will have MSEcol2, MSEcol3 but let's keep it simple and only deal with MSEcol1 here). We would expect that since the second MSE was created using a variable completely random, MSEcol1 would be higher than MSEmod (the higher the MSE the worse). Therefore, when we take the difference of the two MSEcol1 - MSEmod we usually expect a positive number. In your case a negative number shows that the random variable worked better, which shows that it probably the variable is not predictive enough i.e. not important.

请记住,我给您的这个描述是高级别的,实际上两个 MSE 值是按比例缩放的,并且正在计算百分比差异.但高层次的故事是这样的.

Keep in mind that this description I gave you is the high level, in reality the two MSE values are scaled and the percentage difference is being calculated. But the high level story is this.

算法形式:

  1. 计算模型 MSE
  2. 对于模型中的每个变量:
    • 置换变量
    • 根据变量排列计算新模型的 MSE
    • 模型 MSE 和新模型 MSE 的区别

希望现在清楚了!

这篇关于RandomForest 包中的负 %IncMSE 是什么意思?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆