R:在缺少数据的情况下无法从公式中排除变量 [英] R: variable exclusion from formula not working in presence of missing data
问题描述
我正在R中构建一个模型,同时不包括公式中的"office"列(有时会包含我预测的类的提示).我正在学习培训"并正在预测测试":
I'm building a model in R, while excluding 'office' column in the formula (it sometimes contains hints of the class I predict ). I'm learning on 'train' and predicting on 'test':
> model <- randomForest::randomForest(tc ~ . - office, data=train, importance=TRUE,proximity=TRUE )
> prediction <- predict(model, test, type = "class")
所有NA的预测结果:
> head(prediction)
[1] <NA> <NA> <NA> <NA> <NA> <NA>
Levels: 2668 2752 2921 3005
原因是test $ office包含NA:
the reason is that test$office contains NAs:
> head(test$office)
[1] <NA> <NA> <NA> <NA> <NA> <NA>
Levels: 2668 2752 2921 3005
我可以通过删除NA来解决此问题:
I can fix the problem by removing the NAs:
> test2 <- test
> test2$office <- 1
> prediction <- predict(model, test2, type = "class")
> head(prediction)
3 5 10 12 14 18
2921 2752 2921 2752 2921 2752
Levels: 2668 2752 2921 3005
我可以通过从火车数据中而不是从公式中显式删除办公室"列来避免此问题:
I can avoid the problem by explicitly removing the column 'office' from the train data, rather then from the formula:
> model <- randomForest::randomForest(tc ~ ., data=train[,!(names(train) %in% c('office'))], importance=TRUE,proximity=TRUE )
> prediction <- predict(model, test, type = "class")
> head(prediction)
3 5 10 12 14 18
3005 2752 3005 2752 2921 2752
Levels: 2668 2752 2921 3005
>
我的问题-发生这种行为的原因是什么?
my question - what is the reason for that behavior?
是公式 tc〜.-办公室
是要从模型中排除办公室"吗?
was the formula tc ~ . - office
meant to exclude 'office' from the model?
这里有一个优雅的解决方案吗?
is there an elegant solution here?
用户agenis要求提供str(test)的结果;我屏蔽了一些字段名称:
user agenis asked for the result of str(test); I masked some of the field names:
str(test)
'data.frame': 792 obs. of 15 variables:
$ XXX : Factor w/ 2 levels "Force","Retry": 1 2 2 1 2 2 1 1 1 1 ...
$ XXX : Factor w/ 15 levels "25 Westend, Birmingham",..: 6 13 6 15 13 15 10 3 5 12 ...
$ XXX : Factor w/ 3 levels "Instructions Info 1",..: 2 2 3 2 2 2 2 3 3 3 ...
$ XXX : Factor w/ 3 levels "Remittance Info 1",..: 3 1 3 1 2 2 1 1 1 1 ...
$ XXX : Factor w/ 3 levels "CRED","DEBT",..: 3 2 1 2 1 2 1 2 2 3 ...
$ XXX : Factor w/ 3 levels "INTC","LOAN",..: 2 2 2 3 1 3 1 1 3 3 ...
$ XXX : Factor w/ 15 levels "25 Westend, Birmingham",..: 3 9 15 14 5 15 10 11 2 7 ...
$ XXX : Factor w/ 2 levels "SDVA","URGP": 1 2 1 1 1 2 2 2 2 1 ...
$ XXX : Factor w/ 3 levels "CNY","EUR","GBP": 1 2 1 1 2 1 2 1 2 3 ...
$ XXX : Factor w/ 19 levels "BNKADE22XXX",..: 3 19 11 11 4 8 8 8 19 3 ...
$ XXX : Factor w/ 4 levels "_NV_E_","CNY",..: 1 3 2 2 3 2 3 2 3 1 ...
$ XXX : Factor w/ 9 levels "BNKADE22XXX",..: 3 9 1 1 4 8 8 8 9 3 ...
$ tc : Factor w/ 4 levels "604","688","698",..: NA NA NA NA NA NA NA NA NA NA ...
$ office : Factor w/ 4 levels "604","688","698",..: NA NA NA NA NA NA NA NA NA NA ...
谢伊
推荐答案
出于某种原因, randomForest
函数首先要检查整个数据中是否缺少值,然后再查看公式中的内容.如果您在任何地方都没有NA,则会返回一个错误:
For some reason, the randomForest
function is first checking the presence of missing values in the whole data before looking at what's inside your formula.
It returns an error if you have NA wherever columns they are:
na.fail.default(list(mpg = c(21,21,22.8,21.4,18.7,18.1,:缺少对象中的值
Error in na.fail.default(list(mpg = c(21, 21, 22.8, 21.4, 18.7, 18.1, : missing values in object
如果没有丢失的观测值,则您指定的公式是正确的,并且不会使用带有减号的列.
If there are no missing observations, the formula you specified is correct and will not use the column specified with the minus sign.
然后有两种可能性:
- 指定参数
na.action = na.pass
绕过第一个NA检查,该算法将顺利运行而不会出错.此参数意味着您会不采取任何行动",并查看如果保留NA会发生什么情况.它与na.exclude
会删除整个行(您不希望这样做,因为该行的其他变量是必填项)不同, - 手动预处理数据以删除丢失的列或整个列.
- Specify the argument
na.action=na.pass
to bypass the first NA check, the algorithm will run smoothly without error. This argument means litteraly "take no action" and see what's happens if you keep the NA. It's different fromna.exclude
that will remove the entire rows (which you don't want because the other variables of the row are non-missing) - Pre-process manually the data to either remove the missing or the entire column.
代码示例:
df=mtcars
df[2:10, 'am'] <- NA
fit=randomForest::randomForest(mpg~.-am, df, na.action=na.pass)
fit$importance # check the absence of AM variable:
#### IncNodePurity
#### cyl 169.05853
#### disp 267.94975
#### hp 167.03634
#### drat 66.45550
#### wt 276.21383
#### qsec 25.33688
#### vs 30.48513
#### gear 15.39151
#### carb 24.60022
这篇关于R:在缺少数据的情况下无法从公式中排除变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!