randomForest() 和 caret 的 randomForest 的不同结果(method = “rf") [英] Different results with randomForest() and caret's randomForest (method = "rf")

查看:118
本文介绍了randomForest() 和 caret 的 randomForest 的不同结果(method = “rf")的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 caret 的新手,我只想确保我完全理解它在做什么.为此,我一直试图复制我从 randomForest() 模型中获得的结果,使用插入符的 train() 函数 for method="rf".不幸的是,我一直无法获得匹配的结果,我想知道我忽略了什么.

我还要补充一点,鉴于 randomForest 使用引导程序生成样本来拟合每个 ntree,并根据袋外预测估计误差,我对指定oob"之间的区别有点模糊和 trainControl 函数调用中的引导".这些选项生成不同的结果,但都不匹配 randomForest() 模型.

虽然我已经阅读了 caret Package 网站(http://topepo.github.io/caret/index.html),以及看起来可能相关的各种 StackOverflow 问题,但我一直无法弄清楚为什么 caret method = "rf" 模型产生与 randomForest() 不同的结果.非常感谢您提供的任何见解.

这是一个可复制的示例,使用 MASS 包中的 CO2 数据集.

库(MASS)数据(二氧化碳)图书馆(随机森林)set.seed(1)rf.model <- randomForest(uptake ~ .,数据 = 二氧化碳,ntree = 50,节点大小 = 5,mtry=2,重要性=真,公制=RMSE")图书馆(插入符号)set.seed(1)caret.oob.model <- train(uptake ~ .,数据 = 二氧化碳,方法=rf",ntree=50,tuneGrid=data.frame(mtry=2),节点大小 = 5,重要性=真,公制=RMSE",trControl = trainControl(method="oob"),允许平行=假)set.seed(1)caret.boot.model <- train(uptake ~ .,数据 = 二氧化碳,方法=rf",ntree=50,tuneGrid=data.frame(mtry=2),节点大小 = 5,重要性=真,公制=RMSE",trControl=trainControl(method="boot", number=50),允许平行=假)打印(rf.model)打印(caret.oob.model$finalModel)打印(caret.boot.model$finalModel)

产生以下内容:

<块引用>

打印(rf.model)

 残差平方均值:9.380421% Var 解释:91.88

<块引用>

打印(caret.oob.model$finalModel)

 残差平方均值:38.3598% Var 解释:66.81

<块引用>

打印(caret.boot.model$finalModel)

 残差平方均值:42.56646% Var 解释:63.16

以及查看变量重要性的代码:

重要性(rf.model)重要性(caret.oob.model$finalModel)重要性(caret.boot.model$finalModel)

解决方案

在 train 中使用公式接口将因子转换为虚拟.要将 caret 的结果与 randomForest 的结果进行比较,您应该使用非公式接口.

在您的情况下,您应该在 trainControl 中提供一个种子以获得与 randomForest 相同的结果.

Section training 在 caret 网页中,有一些关于可重复性的说明,其中解释了如何使用种子.

library("randomForest")set.seed(1)rf.model <- randomForest(uptake ~ .,数据 = 二氧化碳,ntree = 50,节点大小 = 5,mtry = 2,重要性=真,公制 = "RMSE")图书馆(插入符号")caret.oob.model <- train(CO2[, -5], CO2$uptake,方法 = "rf",ntree = 50,tuneGrid = data.frame(mtry = 2),节点大小 = 5,重要性=真,公制 = "RMSE",trControl = trainControl(method = "oob", 种子 = 1),allowParallel = FALSE)

如果您要进行重采样,则应为每次重采样迭代提供种子,并为最终模型提供额外的种子.?trainControl 中的示例展示了如何创建它们.

在以下示例中,最后一个种子用于最终模型,我将其设置为 1.

seeds <- as.vector(c(1:26), mode = "list")# 对于最终模型种子[[26]] <- 1caret.boot.model <- train(CO2[, -5], CO2$uptake,方法 = "rf",ntree = 50,tuneGrid = data.frame(mtry = 2),节点大小 = 5,重要性=真,公制 = "RMSE",trControl = trainControl(method = "boot", 种子 = 种子),allowParallel = FALSE)

使用 carettrainControl 中的种子正确定义非公式接口,您将在所有三个模型中获得相同的结果:

rf.modelcaret.oob.model$finalcaret.boot.model$final

I am new to caret, and I just want to ensure that I fully understand what it’s doing. Towards that end, I’ve been attempting to replicate the results I get from a randomForest() model using caret’s train() function for method="rf". Unfortunately, I haven’t been able to get matching results, and I’m wondering what I’m overlooking.

I’ll also add that given that randomForest uses bootstrapping to generate samples to fit each of the ntrees, and estimates error based on out-of-bag predictions, I’m a little fuzzy on the difference between specifying "oob" and "boot" in the trainControl function call. These options generate different results, but neither matches the randomForest() model.

Although I’ve read the caret Package website (http://topepo.github.io/caret/index.html), as well as various StackOverflow questions that seem potentially relevant, but I haven’t been able to figure out why the caret method = "rf" model produces different results from randomForest(). Thank you very much for any insight you might be able to offer.

Here’s a replicable example, using the CO2 dataset from the MASS package.

library(MASS)
data(CO2)

library(randomForest)
set.seed(1)
rf.model <- randomForest(uptake ~ ., 
                       data = CO2,
                       ntree = 50,
                       nodesize = 5,
                       mtry=2,
                       importance=TRUE, 
                       metric="RMSE")

library(caret)
set.seed(1)
caret.oob.model <- train(uptake ~ ., 
                     data = CO2,
                     method="rf",
                     ntree=50,
                     tuneGrid=data.frame(mtry=2),
                     nodesize = 5,
                     importance=TRUE, 
                     metric="RMSE",
                     trControl = trainControl(method="oob"),
                     allowParallel=FALSE)

set.seed(1)
caret.boot.model <- train(uptake ~ ., 
                     data = CO2,
                     method="rf",
                     ntree=50,
                     tuneGrid=data.frame(mtry=2),
                     nodesize = 5,
                     importance=TRUE, 
                     metric="RMSE",
                     trControl=trainControl(method="boot", number=50),
                     allowParallel=FALSE)

 print(rf.model)
 print(caret.oob.model$finalModel) 
 print(caret.boot.model$finalModel)

Produces the following:

print(rf.model)

      Mean of squared residuals: 9.380421
                % Var explained: 91.88

print(caret.oob.model$finalModel)

      Mean of squared residuals: 38.3598
                % Var explained: 66.81

print(caret.boot.model$finalModel)

      Mean of squared residuals: 42.56646
                % Var explained: 63.16

And the code to look at variable importance:

importance(rf.model)

importance(caret.oob.model$finalModel)

importance(caret.boot.model$finalModel)

解决方案

Using formula interface in train converts factors to dummy. To compare results from caret with randomForest you should use the non-formula interface.

In your case, you should provide a seed inside trainControl to get the same result as in randomForest.

Section training in caret webpage, there are some notes on reproducibility where it explains how to use seeds.

library("randomForest")
set.seed(1)
rf.model <- randomForest(uptake ~ ., 
                         data = CO2,
                         ntree = 50,
                         nodesize = 5,
                         mtry = 2,
                         importance = TRUE, 
                         metric = "RMSE")

library("caret")
caret.oob.model <- train(CO2[, -5], CO2$uptake, 
                         method = "rf",
                         ntree = 50,
                         tuneGrid = data.frame(mtry = 2),
                         nodesize = 5,
                         importance = TRUE, 
                         metric = "RMSE",
                         trControl = trainControl(method = "oob", seed = 1),
                         allowParallel = FALSE)

If you are doing resampling, you should provide seeds for each resampling iteration and an additional one for the final model. Examples in ?trainControl show how to create them.

In the following example, the last seed is for the final model and I set it to 1.

seeds <- as.vector(c(1:26), mode = "list")

# For the final model
seeds[[26]] <- 1

caret.boot.model <- train(CO2[, -5], CO2$uptake, 
                          method = "rf",
                          ntree = 50,
                          tuneGrid = data.frame(mtry = 2),
                          nodesize = 5,
                          importance = TRUE, 
                          metric = "RMSE",
                          trControl = trainControl(method = "boot", seeds = seeds),
                          allowParallel = FALSE)

Definig correctly the non-formula interface with caret and seed in trainControl you will get the same results in all three models:

rf.model
caret.oob.model$final
caret.boot.model$final

这篇关于randomForest() 和 caret 的 randomForest 的不同结果(method = “rf")的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆