R:在缺少数据的情况下无法从公式中排除变量 [英] R: variable exclusion from formula not working in presence of missing data

查看:53
本文介绍了R:在缺少数据的情况下无法从公式中排除变量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在R中构建一个模型,同时不包括公式中的"office"列(有时会包含我预测的类的提示).我正在学习培训"并正在预测测试":

I'm building a model in R, while excluding 'office' column in the formula (it sometimes contains hints of the class I predict ). I'm learning on 'train' and predicting on 'test':

> model <- randomForest::randomForest(tc ~ . - office, data=train,     importance=TRUE,proximity=TRUE )
> prediction <- predict(model, test, type = "class")

所有NA的预测结果:

> head(prediction)
[1] <NA> <NA> <NA> <NA> <NA> <NA>
Levels: 2668 2752 2921 3005

原因是test $ office包含NA:

the reason is that test$office contains NAs:

> head(test$office)
[1] <NA> <NA> <NA> <NA> <NA> <NA>
Levels: 2668 2752 2921 3005

我可以通过删除NA来解决此问题:

I can fix the problem by removing the NAs:

> test2 <- test
> test2$office <- 1
> prediction <- predict(model, test2, type = "class")
> head(prediction)
   3    5   10   12   14   18 
 2921 2752 2921 2752 2921 2752 
Levels: 2668 2752 2921 3005

我可以通过从火车数据中而不是从公式中显式删除办公室"列来避免此问题:

I can avoid the problem by explicitly removing the column 'office' from the train data, rather then from the formula:

> model <- randomForest::randomForest(tc ~ ., data=train[,!(names(train) %in% c('office'))], importance=TRUE,proximity=TRUE )
> prediction <- predict(model, test, type = "class")
> head(prediction)
   3    5   10   12   14   18 
3005 2752 3005 2752 2921 2752 
Levels: 2668 2752 2921 3005
> 

我的问题-发生这种行为的原因是什么?

my question - what is the reason for that behavior?

是公式 tc〜.-办公室是要从模型中排除办公室"吗?

was the formula tc ~ . - office meant to exclude 'office' from the model?

这里有一个优雅的解决方案吗?

is there an elegant solution here?

用户agenis要求提供str(test)的结果;我屏蔽了一些字段名称:

user agenis asked for the result of str(test); I masked some of the field names:

str(test)
'data.frame':   792 obs. of  15 variables:
 $ XXX              : Factor w/ 2 levels "Force","Retry": 1 2 2 1 2 2 1 1 1 1 ...
 $ XXX                  : Factor w/ 15 levels "25 Westend, Birmingham",..: 6 13 6 15 13 15 10 3 5 12 ...
 $ XXX                  : Factor w/ 3 levels "Instructions Info 1",..: 2 2 3 2 2 2 2 3 3 3 ...
 $ XXX                  : Factor w/ 3 levels "Remittance Info 1",..: 3 1 3 1 2 2 1 1 1 1 ...
 $ XXX                  : Factor w/ 3 levels "CRED","DEBT",..: 3 2 1 2 1 2 1 2 2 3 ...
 $ XXX                  : Factor w/ 3 levels "INTC","LOAN",..: 2 2 2 3 1 3 1 1 3 3 ...
 $ XXX                  : Factor w/ 15 levels "25 Westend, Birmingham",..: 3 9 15 14 5 15 10 11 2 7 ...
 $ XXX                  : Factor w/ 2 levels "SDVA","URGP": 1 2 1 1 1 2 2 2 2 1 ...
 $ XXX                  : Factor w/ 3 levels "CNY","EUR","GBP": 1 2 1 1 2 1 2 1 2 3 ...
 $ XXX                  : Factor w/ 19 levels "BNKADE22XXX",..: 3 19 11 11 4 8 8 8 19 3 ...
 $ XXX                  : Factor w/ 4 levels "_NV_E_","CNY",..: 1 3 2 2 3 2 3 2 3 1 ...
 $ XXX                  : Factor w/ 9 levels "BNKADE22XXX",..: 3 9 1 1 4 8 8 8 9 3 ...
 $ tc                   : Factor w/ 4 levels "604","688","698",..: NA NA NA NA NA NA NA NA NA NA ...
 $ office               : Factor w/ 4 levels "604","688","698",..: NA NA NA NA NA NA NA NA NA NA ...

谢伊

推荐答案

出于某种原因, randomForest 函数首先要检查整个数据中是否缺少值,然后再查看公式中的内容.如果您在任何地方都没有NA,则会返回一个错误:

For some reason, the randomForest function is first checking the presence of missing values in the whole data before looking at what's inside your formula. It returns an error if you have NA wherever columns they are:

na.fail.default(list(mpg = c(21,21,22.8,21.4,18.7,18.1,:缺少对象中的值

Error in na.fail.default(list(mpg = c(21, 21, 22.8, 21.4, 18.7, 18.1, : missing values in object

如果没有丢失的观测值,则您指定的公式是正确的,并且不会使用带有减号的列.

If there are no missing observations, the formula you specified is correct and will not use the column specified with the minus sign.

然后有两种可能性:

  1. 指定参数 na.action = na.pass 绕过第一个NA检查,该算法将顺利运行而不会出错.此参数意味着您会不采取任何行动",并查看如果保留NA会发生什么情况.它与 na.exclude 会删除整个行(您不希望这样做,因为该行的其他变量是必填项)不同,
  2. 手动预处理数据以删除丢失的列或整个列.
  1. Specify the argument na.action=na.pass to bypass the first NA check, the algorithm will run smoothly without error. This argument means litteraly "take no action" and see what's happens if you keep the NA. It's different from na.exclude that will remove the entire rows (which you don't want because the other variables of the row are non-missing)
  2. Pre-process manually the data to either remove the missing or the entire column.

代码示例:

df=mtcars
df[2:10, 'am'] <- NA
fit=randomForest::randomForest(mpg~.-am, df, na.action=na.pass)
fit$importance # check the absence of AM variable:
####      IncNodePurity
#### cyl      169.05853
#### disp     267.94975
#### hp       167.03634
#### drat      66.45550
#### wt       276.21383
#### qsec      25.33688
#### vs        30.48513
#### gear      15.39151
#### carb      24.60022

这篇关于R:在缺少数据的情况下无法从公式中排除变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆