R随机森林不一致的预测 [英] R random forest inconsistent predictions

查看:217
本文介绍了R随机森林不一致的预测的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近使用R中的游侠包构建了一个随机森林模型.但是,我注意到训练期间存储在游侠对象中的预测(可通过model $ predictions访问)与运行预测时得到的预测不匹配.使用创建的模型对同一数据集执行命令.以下代码在mtcars数据集上重现了该问题.我创建了一个二进制变量只是为了将其转换为分类问题,尽管我在回归树上也看到了类似的结果.

I recently built a random forest model using the ranger package in R. However, I noticed that the predictions stored in the ranger object during training (accessible with model$predictions) do not match the prediction I get if I run the predict command on the same dataset using the model created. The following code reproduces the problem on the mtcars dataset. I created a binary variable just for the sake of converting this to a classification problem though I saw similar results with regression trees as well.

library(datasets)
library(ranger)
mtcars <- mtcars
mtcars$mpg2 <- ifelse(mtcars$mpg > 19.2 , 1, 0)
mtcars <- mtcars[,-1]
mtcars$mpg2 <- as.factor(mtcars$mpg2)
set.seed(123)
mod <- ranger(mpg2 ~ ., mtcars, num.trees = 20, probability = T)
mod$predictions[1,] # Probability of 1 = 0.905
predict(mod, mtcars[1,])$predictions # Probability of 1 = 0.967

这个问题也持续到randomForest软件包中,在该软件包中,我观察到了以下代码可重现的类似问题.

This problem also carries on to the randomForest package where I observed a similar problem reproducible with the following code.

library(randomForest)
set.seed(123)
mod <- randomForest(mpg2 ~ ., mtcars, ntree = 20)
mod$votes[1,]
predict(mod, mtcars[1,], type = "prob")

有人可以告诉我为什么会这样吗?我希望结果是一样的.我是在做错什么还是在我对导致这种情况的随机森林的某些固有属性的理解中出现错误?

Can someone please tell me why this is happening? I would expect the results to be the same. Am I doing something wrong or is there an error in my understanding of some inherent property of random forest that leads to this scenario?

推荐答案

我认为您可能需要更深入地研究随机森林的工作原理.我真的推荐R中的统计学习入门(ISLR),该在线免费在线提供此处.

I think you may want to look a little more deeply into how a random forest works. I really recommend Introduction to Statistical Learning in R (ISLR), which is available for free online here.

也就是说,我相信这里的主要问题是,当mod $ votes值和predict()值不完全相同时,它们被视为相同.如果查看randomForest函数的文档,则mod$votesmod$predicted值对于输入数据来说是不合算的("OOB")预测.这与predict()函数产生的值不同,该值评估由randomForest()产生的模型的观察值.通常,您可能希望在一组数据上训练模型,并在测试集上使用predict()函数.

That said, I believe the main issue here is that you are treating the mod$votes value and the predict() value as the same, when they are not quite the same thing. If you look at the documentation of the randomForest function, the mod$votes or mod$predicted values are out-of-bag ("OOB") predictions for the input data. This is different from the value that the predict() function produces, which evaluates an observation on the model produced by randomForest(). Typically, you would want to train the model on one set of data, and use the predict() function on the test set.

最后,如果要使mod对象获得相同的结果,则每次创建随机森林时,可能都需要重新运行set.seed()函数.我认为有一种方法可以为整个会话设置种子,但是我不确定.这看起来很有用:修复整个会话的set.seed

Finally, you may need to re-run your set.seed() function every time your make the random forest if you want to achieve the same results for the mod object. I think there is a way to set the seed for an entire session, but I am not sure. This looks like a useful post: Fixing set.seed for an entire session

侧面说明:在这里,您没有指定要用于每棵树的变量数,但是在大多数情况下,默认值就足够了(请查阅用于默认值的每个随机森林函数的文档).也许您是在实际的代码中这样做的,并且没有在示例中包括它,但我认为值得一提.

Side note: Here, you are not specifying the number of variables to use for each tree, but the default is good enough in most cases (check the documentation for each of the random forest functions you are using for the default). Maybe you are doing that in your actual code and didn't include it in your example, but I thought it was worth mentioning.

希望这会有所帮助!

我尝试使用除第一个观察值(马自达RX4)以外的所有数据训练随机森林,然后在该观察值上使用预测功能,我认为这更好地说明了我的观点.尝试运行类似这样的内容:

I tried training the random forest using all of the data except for the first observation (Mazda RX4) and then used the predict function on just that observation, which I think illustrates my point a bit better. Try running something like this:

library(randomForest)
set.seed(123)
mod <- randomForest(mpg2 ~ ., mtcars[-1,], ntree = 200)
predict(mod, mtcars[1,], type = "prob")

这篇关于R随机森林不一致的预测的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆