r 插入符号包,如果我为 rfe 控制和列车控制指定了索引,则会出错 [英] r caret package, error if I specified index for both rfe control and train control

查看:50
本文介绍了r 插入符号包,如果我为 rfe 控制和列车控制指定了索引,则会出错的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我为 rfe.control 和 train.control 指定索引时出现错误

I'm getting error when I specified index for rfe.control and train.control

为了制作我编码的 glmnet rfe 函数

For making glmnet rfe function I coded

glmnetFuncs <- caretFuncs #Default caret functions

glmnetFuncs$summary <-  twoClassSummary

用于为 rfe.control 指定索引

For specifying index for rfe.control

MyRFEcontrol <- rfeControl(
  method="LGOCV",
  number=5,
  index=RFE_CV_IN,
  functions = glmnetFuncs,
  verbose = TRUE)

用于为 train.control 指定索引

For specifying index for train.control

MyTrainControl=trainControl(
  method="LGOCV",
  index=indexIN,
  classProbs = TRUE,
  summaryFunction=twoClassSummary
)

由于数据量大,我只随机选择3列以确保其有效,

Since the data size is big, I just choose random 3 columns to make sure that it works,

x=train_v_final4[,c(1,30,55)]
y=TARGET


RFE <- rfe(x=x,y=y,sizes = seq(2,3,by=1),
           metric = 'ROC',maximize=TRUE,rfeControl = MyRFEcontrol,
           method='glmnet',
          # tuneGrid = expand.grid(.alpha=c(0,0.1,1),.lambda=c(0.1,0.01,0.05)),
           trControl = MyTrainControl)

但是我有一个错误说

**model fit failed for a: alpha=0.10, lambda=3 Error in if (!all(o)) { : missing value where TRUE/FALSE needed**

我尝试了所有其他可能的方法.

I tried all other possible ways.

  1. 在 rfe.control 和 train.Control 中指定索引,

  1. specifying index in rfe.control and train.Control ,

在 rfe.control 中指定索引但不在 train.control 中,

specifying index in rfe.control but not in train.control,

在 train.control 中指定索引但不在 rfe.control 中.

specifying index in train.control but not in rfe.control.

但是,它们都不起作用.但是如果我在 train() 函数中使用这些索引列表,它就可以正常工作.有谁知道我需要修复什么?非常感谢任何评论/想法!

However, non of them works. But it works fine if I use these index list in train() function. Does anyone know what I need to fix? Any comments/thoughts are much appreciated !

详情

> nearZeroVar(x[indexIN[[1]],])
integer(0) #other results (nearZeroVar(x[indexIN[[2]],])..etc...)are omitted since the             outputs are identical. 

> cor(x[indexIN[[1]],])
                         id category_q total_spent_90
id             1.0000000000  0.0300781   0.0001837173
category_q     0.0300781045  1.0000000   0.4102276754
total_spent_90 0.0001837173  0.4102277   1.0000000000

> nearZeroVar(x[RFE_CV_IN[[1]],])
integer(0)

> cor(x[RFE_CV_IN[[1]],])
                          id  category_q total_spent_90
id              1.0000000000 0.002903591  -0.0004827006
category_q      0.0029035912 1.000000000   0.9612495056
total_spent_90 -0.0004827006 0.961249506   1.0000000000


> str(RFE_CV_IN)
List of 20
 $ Resample01: int [1:28670] 8 12 35 39 47 51 55 66 71 76 ...
 $ Resample02: int [1:28670] 1 5 7 38 39 49 55 76 91 100 ...
 $ Resample03: int [1:28670] 1 5 7 8 18 30 38 39 49 63 ...
 $ Resample04: int [1:28670] 9 12 18 24 30 35 38 39 49 51 ...
 $ Resample05: int [1:28670] 8 30 47 49 51 63 71 76 77 92 ...
 $ Resample06: int [1:28670] 1 18 30 39 49 55 63 66 71 77 ...
 $ Resample07: int [1:28670] 5 18 24 25 51 76 91 101 112 116 ...
 $ Resample08: int [1:28670] 1 5 7 12 24 25 38 39 49 51 ...
 $ Resample09: int [1:28670] 8 18 24 25 38 49 51 76 101 113 ...
 ....omit rest...

> str(indexIN)
List of 20
 $ Resample01: int [1:64024] 1 6 11 12 14 15 17 19 20 22 ...
 $ Resample02: int [1:64024] 8 11 13 14 18 19 21 22 24 25 ...
 $ Resample03: int [1:64024] 1 3 4 6 11 13 14 15 16 21 ...
 $ Resample04: int [1:64024] 3 9 11 12 13 14 22 24 26 28 ...
.....omit rest

推荐答案

问题可能在于外部函数 (rfe) 使用与原始数据相同的行指示符,但是一旦 train 看到数据,那些行号并不意味着同样的事情.

The problem might be that the outer function (rfe) uses the same row indicators as the original data but, once train sees the data, those row numbers don't mean the same thing.

假设你有 100 个数据点并且正在做 10 倍的 CV,第一倍是 1-10,第二倍是 11-20 等等.

Suppose you have 100 data points and are doing 10-fold CV and the first fold is 1-10, the second is 11-20 etc.

在第一次折叠时,rfe 将第 11-100 行传递给 train.如果 train 中的 index 向量有任何索引 > 90,则会出现错误.如果没有,它可能会运行,但不会与您最初告诉 train 使用的行一起运行.

On the first fold, rfe passes rows 11-100 to train. If the index vector in train has any indices > 90, there will be an error. If not, it may run but not with the rows that you originally told train to use.

您可以这样做,但它需要为外部模型的每个重采样(即 ref)设置一组单独的重采样索引,因为每次内部数据都会不同.此外,如果您进行自举,您需要非常小心,因为它带有替换样本;如果不是,您的模型构建数据和保留数据中可能有相同的确切记录.

You could do this but it will require a separate set of resample indices for each resample of the outer model (i.e. ref) since the inner data will be different each time. Also, you would need to be really careful if you do bootstrapping since it samples with replacement; if not your model building data and the holdout data could have the same exact records in them.

如果您确实想要可重现/可追溯性,请在 rfeControltrainControl 中设置种子.我很确定你会在不同的运行中得到相同的重采样(只要数据集和重采样方法在运行中保持相同).

If you really want reproducible/traceability, set the seed in rfeControl and trainControl. I'm pretty sure that you will get the same resamples across different runs (as long as the data set and resampling methods stay the same across runs).

最大

这篇关于r 插入符号包,如果我为 rfe 控制和列车控制指定了索引,则会出错的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆