R插入包rfe永远不会完成错误任务1失败-“替换长度为零". [英] R caret package rfe never finishes error task 1 failed - "replacement has length zero"

查看:133
本文介绍了R插入包rfe永远不会完成错误任务1失败-“替换长度为零".的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近开始研究我正在开发的模型的插入符号包.我正在使用最新版本.第一步,我决定将其用于功能选择.我正在使用的数据具有约760个功能和10k观测值.我根据在线培训材料创建了一个简单的函数.不幸的是,我始终遇到错误,因此该过程永远无法完成.这是产生错误的代码.在此示例中,我使用了一小部分功能.我从全套功能开始.我还更改了子集,折叠次数和重复次数都无济于事.我知道没有数据将很难找到问题所在.我有共享数据的一小部分(在r对象中格式,如下所示).如果您无法从那里获取文件,请尝试以下链接.

I recently started to look into caret package for a model I'm developing. I'm using the latest version. As the first step, I decided to use it for feature selection. The data I'm using has about 760 features and 10k observations. I created a simple function based on the training material on line. Unfortunately, I consistently get an error and so the process never finishes. Here is the code that produces error. In this example I am using a small subset of features. I started with the full set of features. I've also changed the subsets, number of folds and repeats to no avail. I know it will be hard to track down the issue without the data. I have shared a small subset of the data(in r object format as used below). If you have trouble getting the file from there try this link.

它总是产生此错误:

{中的错误:任务1失败-替换的长度为零"

Error in { : task 1 failed - "replacement has length zero"

caretFeatureSelection <- function() {
  library(caret)
  library(mlbench)
  library(Hmisc)

  set.seed(10)

  lr.features = c("f2", f271","f527","f528","f404", "f376", "f67",  "f670", "f281", "f333", "f13",  "f282", "f599",
                  "f597", "f68",  "f629", "f378", "f230", "f229", "f273", "f768", "f406", "f630", 
                  "f596", "f598", "f413", "f412", "f332", "f377", "f766", "f767", "f775", "f10", "f442")

  trainDF <- readRDS(file='trainDF.rds')
  trainDF <- trainDF[trainDF$loss>0,]
  trainDF$lossProb <- trainDF$loss/100
  y <- trainDF[,'lossProb']
  x <- trainDF[,names(trainDF) %in% lr.features]

  rm(trainDF)

  subsets <- c(1:5, 10, 15, 20, 25)
  ctrl <- rfeControl(functions = lrFuncs,
                   method = "repeatedcv",
                   repeats = 1,
                   number=5)

  lrProfile <- rfe(x, y,
                 sizes = subsets,
                 rfeControl = ctrl)

  lrProfile
}

推荐答案

因此,查看数据,有三个原因导致失败.首先,

So looking at the data, there are three reasons for the failure. First,

> str(x)
'data.frame':   100 obs. of  34 variables:
 $ f2  : Factor w/ 10 levels "1","2","3","4",..: 8 8 8 8 9 8 9 9 7 8 ...
<snip>

rfe使lm模型适合这些数据,并且即使数据帧x具有34列,也会生成39个系数.结果,rfe变得...困惑.在运行rfe之前,尝试使用model.matrix将因子转换为虚拟变量:

rfe fits an lm model to these data and generates 39 coefficients even though the data frame x has 34 columns. As a result, rfe gets... confused. Try using model.matrix to convert the factor to dummy variables before running rfe:

x2 <- model.matrix(~., data = x)[,-1]  ## the -1 removes the intercept column

...但是...

> table(x$f2)

 1  2  3  4  6  7  8  9 10 11 
 0  0  0  2  2  5 32 36 23  0 

因此,model.matrix将生成一些零方差预测变量(这是一个问题).您可以使用新水平创建一个排除空水平的新因子,但要记住,对这些数据的任何重新采样都会将某些因子水平(例如"4","6")强制转换为零方差预测变量.

so model.matrix will generate some zero-variance predictors (which is an issue). You could make a new factor with new levels that excludes the empty levels but keep in mind that any resampling on these data will coerce some of the factor levels (e.g. "4", "6") into zero-variance predictors.

其次,一些预测变量之间存在完美的相关性:

Secondly, there is perfect correlation between some predictors:

> cor(x$f597, x$f599)
     [,1]
[1,]    1

这将导致某些模型系数的NA值,并导致缺少变量重要性,并且会积蓄rfe.

This will cause NA values for some of the model coefficients and lead to missing variable importances and will tank rfe.

除非您使用的树或其他可以容忍稀疏和/或相关的预测变量的模型,否则rfe之前的可能工作流程可能是:

Unless you are using trees or some other model that is tolerant to sparse and/or correlated predictors, a possible workflow prior to rfe could be:

> x2 <- model.matrix(~., data = x)[,-1]
> 
> nzv <- nearZeroVar(x2)
> x3 <- x2[, -nzv]
> 
> corr_mat <- cor(x3)
> too_high <- findCorrelation(corr_mat, cutoff = .9)
> x4 <- x3[, -too_high]
> 
> c(ncol(x2), ncol(x3), ncol(x4))
[1] 42 37 27

最后,根据y的外观,您希望预测数字,但是lrFuncs用于逻辑回归,因此我认为它是lmFuncs的错字.在这种情况下,rfe可以正常工作:

Lastly, by the looks of y you want to predict a number but lrFuncs is for logistic regression so I assume it was a typo for lmFuncs. If that is the case, rfe works fine:

> subsets <- c(1:5, 10, 15, 20, 25)
> ctrl <- rfeControl(functions = lmFuncs,
+                    method = "repeatedcv",
+                    repeats = 1,
+                    number=5)
> set.seed(1)
> lrProfile <- rfe(as.data.frame(x4), y,
+                  sizes = subsets,
+                  rfeControl = ctrl)

最大

这篇关于R插入包rfe永远不会完成错误任务1失败-“替换长度为零".的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆