在使用插入式的train()使用公式训练的randomForest对象上使用predict()时出错 [英] Error when using predict() on a randomForest object trained with caret's train() using formula

查看:153
本文介绍了在使用插入式的train()使用公式训练的randomForest对象上使用predict()时出错的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在64位Linux计算机上将R 3.2.0与插入号6.0-41和randomForest 4.6-10一起使用.

Using R 3.2.0 with caret 6.0-41 and randomForest 4.6-10 on a 64-bit Linux machine.

当尝试使用公式从caret包中的train()函数训练的randomForest对象上使用predict()方法时,该函数将返回错误. 当通过randomForest()和/或使用x=y=而不是公式进行训练时,它们都运行平稳.

When trying to use the predict() method on a randomForest object trained with the train() function from the caret package using a formula, the function returns an error. When training via randomForest() and/or using x= and y= rather than a formula, it all runs smoothly.

这是一个有效的示例:

library(randomForest)
library(caret)

data(imports85)
imp85     <- imports85[, c("stroke", "price", "fuelType", "numOfDoors")]
imp85     <- imp85[complete.cases(imp85), ]
imp85[]   <- lapply(imp85, function(x) if (is.factor(x)) x[,drop=TRUE] else x) ## Drop empty levels for factors.

modRf1  <- randomForest(numOfDoors~., data=imp85)
caretRf <- train( numOfDoors~., data=imp85, method = "rf" )
modRf2  <- caretRf$finalModel
modRf3  <- randomForest(x=imp85[,c("stroke", "price", "fuelType")], y=imp85[, "numOfDoors"])
caretRf <- train(x=imp85[,c("stroke", "price", "fuelType")], y=imp85[, "numOfDoors"], method = "rf")
modRf4  <- caretRf$finalModel

p1      <- predict(modRf1, newdata=imp85)
p2      <- predict(modRf2, newdata=imp85)
p3      <- predict(modRf3, newdata=imp85)
p4      <- predict(modRf4, newdata=imp85)

在最后4行中,只有第二行p2 <- predict(modRf2, newdata=imp85)返回以下错误:

Among the last 4 lines, only the second one p2 <- predict(modRf2, newdata=imp85) returns the following error:

Error in predict.randomForest(modRf2, newdata = imp85) : 
variables in the training data missing in newdata

该错误的原因似乎是predict.randomForest方法使用rownames(object$importance)来确定用于训练随机森林object的变量的名称.而当看着

It seems that the reason for this error is that the predict.randomForest method uses rownames(object$importance) to determine the name of the variables used to train the random forest object. And when looking at

rownames(modRf1$importance)
rownames(modRf2$importance)
rownames(modRf3$importance)
rownames(modRf4$importance)

我们看到了:

[1] "stroke"   "price"    "fuelType"
[1] "stroke"   "price"    "fuelTypegas"
[1] "stroke"   "price"    "fuelType"
[1] "stroke"   "price"    "fuelType"

以某种方式,当将caret train()函数与公式一起使用时,会更改randomForest对象的importance字段中的(因子)变量的名称.

So somehow, when using the caret train() function with a formula changes the name of the (factor) variables in the importance field of the randomForest object.

插入符号train()函数的公式和非公式版本之间确实不一致吗?还是我错过了什么?

Is it really an inconsistency between the formula and and non-formula version of the caret train() function? Or am I missing something?

推荐答案

首先,几乎从不使用$finalModel对象进行预测.使用predict.train.这是一个很好的例子.

First, almost never use the $finalModel object for prediction. Use predict.train. This is one good example of why.

某些函数(包括randomForesttrain)如何处理伪变量之间存在一些不一致之处. R中大多数使用公式方法的函数都会将因子预测变量转换为虚拟变量,因为它们的模型需要数据的数字表示形式.例外是基于树和基于规则的模型(可以在分类预测变量上拆分),朴素贝叶斯和其他一些模型.

There is some inconsistency between how some functions (including randomForest and train) handle dummy variables. Most functions in R that use the formula method will convert factor predictors to dummy variables because their models require numerical representations of the data. The exceptions to this are tree- and rule-based models (that can split on categorical predictors), naive Bayes, and a few others.

因此,当您使用randomForest(y ~ ., data = dat)时,randomForest不会不会创建伪变量,但是train(以及大多数其他)将使用类似train(y ~ ., data = dat)的调用.

So randomForest will not create dummy variables when you use randomForest(y ~ ., data = dat) but train (and most others) will using a call like train(y ~ ., data = dat).

发生错误是因为fuelType是一个因素. train创建的伪变量没有相同的名称,因此predict.randomForest找不到它们.

The error occurs because fuelType is a factor. The dummy variables created by train don't have the same names so predict.randomForest can't find them.

train中使用非公式方法会将因子预测变量传递给randomForest,一切都会正常工作.

Using the non-formula method with train will pass the factor predictors to randomForest and everything will work.

TL; DR

如果想要相同的级别,请对train使用非公式方法,或者 使用predict.train

Use the non-formula method with train if you want the same levels or use predict.train

最大

这篇关于在使用插入式的train()使用公式训练的randomForest对象上使用predict()时出错的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆