RandomForest 包中的负 %IncMSE 是什么意思? [英] What does negative %IncMSE in RandomForest package mean?
问题描述
我使用 RandomForest 解决回归问题.我使用 importance(rf,type=1)
来获取变量的 %IncMSE,其中一个具有负的 %IncMSE.这是否意味着该变量对模型不利?我在互联网上搜索了一些答案,但没有找到明确的答案.我在模型的摘要中也发现了一些奇怪的东西(附在下面),虽然我将 ntrees
定义为 800,但似乎只使用了一棵树.
I used RandomForest for a regression problem. I used importance(rf,type=1)
to get the %IncMSE for the variables and one of them has a negative %IncMSE. Does this mean that this variable is bad for the model? I searched the Internet to get some answers but I didn't find a clear one.
I also found something strange in the model's summary ( attached below), It seems that only one tree was used although I defined ntrees
as 800.
型号:
rf<-randomForest(var1~va2+var3+..+var35,data=d7depo,ntree=800,keep.forest=FALSE, importance=TRUE)
summary(rf)
Length Class Mode
call 6 -none- call
type 1 -none- character
predicted 26917 -none- numeric
mse 800 -none- numeric
rsq 800 -none- numeric
oob.times 26917 -none- numeric
importance 70 -none- numeric
importanceSD 35 -none- numeric
localImportance 0 -none- NULL
proximity 0 -none- NULL
ntree 1 -none- numeric
mtry 1 -none- numeric
forest 0 -none- NULL
coefs 0 -none- NULL
y 26917 -none- numeric
test 0 -none- NULL
inbag 0 -none- NULL
terms 3 terms call
推荐答案
问题 1 - 为什么 ntree
显示 1?:
Question 1 - why does ntree
show 1?:
summary(rf)
显示包含在 rf
变量中的对象的长度.这意味着 rf$ntree
的长度为 1.如果您在控制台上键入 rf$tree
,您将看到它显示 800.
summary(rf)
shows you the length of the objects that are included in your rf
variable. That means that rf$ntree
is of length 1. If you type on your console rf$tree
you will see that it shows 800.
问题 2 - 负的 %IncMSE 是否表示坏"变量?
IncMSE:
计算方法是最初计算整个模型的 MSE.我们称之为MSEmod
.在此之后,对于每个变量(数据集中的列),这些值都被随机打乱(排列),以便创建坏"变量并计算新的 MSE.IE.想象一下,对于一列,您有 1、2、3、4、5 行.排列后,这些最终将随机变为 4,3,1,2,5.排列后(所有其他列保持完全相同,因为我们要检查 col1's
的重要性),正在计算模型的新 MSE,我们称之为 MSEcol1
(以类似的方式,您将拥有 MSEcol2
、MSEcol3
,但让我们保持简单,这里只处理 MSEcol1
).我们预计,由于第二个 MSE 是使用完全随机的变量创建的,MSEcol1
将高于 MSEmod
(MSE 越高越差).因此,当我们取两个 MSEcol1
- MSEmod
的差值时,我们通常期望一个正数.在您的情况下,负数表明随机变量效果更好,这表明该变量可能不够预测,即不重要.
IncMSE:
The way this is calculated is by computing the MSE of the whole model initially. Let's call this MSEmod
. After this for each one of the variables (columns in your data set) the values are randomly shuffled (permuted) so that a "bad" variable is being created and a new MSE is being calculated. I.e. imagine for that for one column you had rows 1,2,3,4,5. After the permutation these will end up being 4,3,1,2,5 at random. After the permutation (all of the other columns remain exactly the same since we want to examine col1's
importance), the new MSE of the model is being calculated, let's call it MSEcol1
(in a similar manner you will have MSEcol2
, MSEcol3
but let's keep it simple and only deal with MSEcol1
here). We would expect that since the second MSE was created using a variable completely random, MSEcol1
would be higher than MSEmod
(the higher the MSE the worse). Therefore, when we take the difference of the two MSEcol1
- MSEmod
we usually expect a positive number. In your case a negative number shows that the random variable worked better, which shows that it probably the variable is not predictive enough i.e. not important.
请记住,我给您的这个描述是高级别的,实际上两个 MSE 值是按比例缩放的,并且正在计算百分比差异.但高层次的故事是这样的.
Keep in mind that this description I gave you is the high level, in reality the two MSE values are scaled and the percentage difference is being calculated. But the high level story is this.
算法形式:
- 计算模型 MSE
- 对于模型中的每个变量:
- 置换变量
- 根据变量排列计算新模型的 MSE
- 模型 MSE 和新模型 MSE 的区别
希望现在清楚了!
这篇关于RandomForest 包中的负 %IncMSE 是什么意思?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!