在使用插入式的train()使用公式训练的randomForest对象上使用predict()时出错 [英] Error when using predict() on a randomForest object trained with caret's train() using formula
问题描述
在64位Linux计算机上将R 3.2.0与插入号6.0-41和randomForest 4.6-10一起使用.
Using R 3.2.0 with caret 6.0-41 and randomForest 4.6-10 on a 64-bit Linux machine.
当尝试使用公式从caret
包中的train()
函数训练的randomForest
对象上使用predict()
方法时,该函数将返回错误.
当通过randomForest()
和/或使用x=
和y=
而不是公式进行训练时,它们都运行平稳.
When trying to use the predict()
method on a randomForest
object trained with the train()
function from the caret
package using a formula, the function returns an error.
When training via randomForest()
and/or using x=
and y=
rather than a formula, it all runs smoothly.
这是一个有效的示例:
library(randomForest)
library(caret)
data(imports85)
imp85 <- imports85[, c("stroke", "price", "fuelType", "numOfDoors")]
imp85 <- imp85[complete.cases(imp85), ]
imp85[] <- lapply(imp85, function(x) if (is.factor(x)) x[,drop=TRUE] else x) ## Drop empty levels for factors.
modRf1 <- randomForest(numOfDoors~., data=imp85)
caretRf <- train( numOfDoors~., data=imp85, method = "rf" )
modRf2 <- caretRf$finalModel
modRf3 <- randomForest(x=imp85[,c("stroke", "price", "fuelType")], y=imp85[, "numOfDoors"])
caretRf <- train(x=imp85[,c("stroke", "price", "fuelType")], y=imp85[, "numOfDoors"], method = "rf")
modRf4 <- caretRf$finalModel
p1 <- predict(modRf1, newdata=imp85)
p2 <- predict(modRf2, newdata=imp85)
p3 <- predict(modRf3, newdata=imp85)
p4 <- predict(modRf4, newdata=imp85)
在最后4行中,只有第二行p2 <- predict(modRf2, newdata=imp85)
返回以下错误:
Among the last 4 lines, only the second one p2 <- predict(modRf2, newdata=imp85)
returns the following error:
Error in predict.randomForest(modRf2, newdata = imp85) :
variables in the training data missing in newdata
该错误的原因似乎是predict.randomForest
方法使用rownames(object$importance)
来确定用于训练随机森林object
的变量的名称.而当看着
It seems that the reason for this error is that the predict.randomForest
method uses rownames(object$importance)
to determine the name of the variables used to train the random forest object
. And when looking at
rownames(modRf1$importance)
rownames(modRf2$importance)
rownames(modRf3$importance)
rownames(modRf4$importance)
我们看到了:
[1] "stroke" "price" "fuelType"
[1] "stroke" "price" "fuelTypegas"
[1] "stroke" "price" "fuelType"
[1] "stroke" "price" "fuelType"
以某种方式,当将caret
train()
函数与公式一起使用时,会更改randomForest
对象的importance
字段中的(因子)变量的名称.
So somehow, when using the caret
train()
function with a formula changes the name of the (factor) variables in the importance
field of the randomForest
object.
插入符号train()
函数的公式和非公式版本之间确实不一致吗?还是我错过了什么?
Is it really an inconsistency between the formula and and non-formula version of the caret train()
function? Or am I missing something?
推荐答案
首先,几乎从不使用$finalModel
对象进行预测.使用predict.train
.这是一个很好的例子.
First, almost never use the $finalModel
object for prediction. Use predict.train
. This is one good example of why.
某些函数(包括randomForest
和train
)如何处理伪变量之间存在一些不一致之处. R中大多数使用公式方法的函数都会将因子预测变量转换为虚拟变量,因为它们的模型需要数据的数字表示形式.例外是基于树和基于规则的模型(可以在分类预测变量上拆分),朴素贝叶斯和其他一些模型.
There is some inconsistency between how some functions (including randomForest
and train
) handle dummy variables. Most functions in R that use the formula method will convert factor predictors to dummy variables because their models require numerical representations of the data. The exceptions to this are tree- and rule-based models (that can split on categorical predictors), naive Bayes, and a few others.
因此,当您使用randomForest(y ~ ., data = dat)
时,randomForest
不会不会创建伪变量,但是train
(以及大多数其他)将使用类似train(y ~ ., data = dat)
的调用.
So randomForest
will not create dummy variables when you use randomForest(y ~ ., data = dat)
but train
(and most others) will using a call like train(y ~ ., data = dat)
.
发生错误是因为fuelType
是一个因素. train
创建的伪变量没有相同的名称,因此predict.randomForest
找不到它们.
The error occurs because fuelType
is a factor. The dummy variables created by train
don't have the same names so predict.randomForest
can't find them.
在train
中使用非公式方法会将因子预测变量传递给randomForest
,一切都会正常工作.
Using the non-formula method with train
will pass the factor predictors to randomForest
and everything will work.
TL; DR
如果想要相同的级别,请对train
使用非公式方法,或者 使用predict.train
Use the non-formula method with train
if you want the same levels or use predict.train
最大
这篇关于在使用插入式的train()使用公式训练的randomForest对象上使用predict()时出错的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!