在 caret R 包中控制交叉验证的抽样 [英] Controlling sampling for crossvalidation in the caret R package

查看:89
本文介绍了在 caret R 包中控制交叉验证的抽样的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下问题.在来自 N 个科目的数据集中,我每个科目都有几个样本.我想在数据集上训练一个模型,但我想确保在每次重采样中,在训练集中没有受试者的重复.

I have the following problem. In a data set from N subjects I have several samples per subject. I want to train a model on the data set, but I would like to make sure that in each resampling, in the training set there are no replicates of the subjects.

或者,我会按主题阻止交叉验证.这可能吗?

Alternatively, I would block the cross-validation by subject. Is that possible?

如果没有 caret 包,我会做类似的事情(模拟代码)

Without the caret package, I would do something like that (mock code)

subjects <- paste0("X", 1:10)
samples  <- rep(subjects, each=5)
x <- matrix(runif(50 * 10), nrow=50)
loocv <- function(x, samples) {
  for(i in 1:nrow(x)) {
     test <- x[i,]
     train <- x[ samples != samples[i],]
     # create the model from train and predict for test
  }
}

,或者,

looSubjCV <- function(x, samples, subjects) {
   for(i in 1:length(subjects)) {
     test <- x[ samples == subjects[i], ]
     train <- x[ samples != subjects[i], ]
     # create the model from train and predict for test
  }
}

否则,来自同一主题的其他样本的存在将导致模型过度拟合.

Otherwise, the presence of other samples from the same subject will result in overfitting of the model.

推荐答案

不是直接的,但是你绝对可以使用 indexindexOut 参数给 trainControl.这是一个使用 10 倍 CV 的示例:

Not directly but you can definitely do it using the index and indexOut arguments to trainControl. Here is an example using 10-fold CV:

library(caret)
library(nlme)

data(Orthodont)
head(Orthodont)
subjects <- as.character(unique(Orthodont$Subject))

## figure out folds at the subject level

set.seed(134)
sub_folds <- createFolds(y = subjects, list = TRUE, returnTrain = TRUE)

## now create the mappings to which *rows* are in the training set
## based on which subjects are left in or out

in_train <- holdout <- vector(mode = "list", length = length(sub_folds))

row_index <- 1:nrow(Orthodont)

for(i in seq(along = sub_folds)) {
  ## Which subjects are in fold i
  sub_in <- subjects[sub_folds[[i]]]
  ## which rows of the data correspond to those subjects
  in_train[[i]] <- row_index[Orthodont$Subject %in% sub_in]
  holdout[[i]]  <- row_index[!(Orthodont$Subject %in% sub_in)]  
}

names(in_train) <- names(holdout) <- names(sub_folds)

ctrl <- trainControl(method = "cv",
                     savePredictions = TRUE,
                     index = in_train,
                     indexOut = holdout)

mod <- train(distance ~ (age+Sex)^2, data = Orthodont,
             method = "lm", 
             trControl = ctrl)

first_fold <- subset(mod$pred, Resample == "Fold01")

## These were used to fit the model
table(Orthodont$Subject[-first_fold$rowIndex])
## These were heldout:
table(Orthodont$Subject[first_fold$rowIndex])

这篇关于在 caret R 包中控制交叉验证的抽样的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆