插入符号中的交叉验证中的预处理 [英] preprocess within cross-validation in caret
问题描述
我有一个关于数据预处理的问题需要澄清.据我了解,当我们通过交叉验证调整超参数和估计模型性能时,而不是预处理整个数据集,我们需要在交叉验证中进行.换句话说,在交叉验证中,我们预处理训练折叠,然后使用相同的预处理参数处理测试折叠并进行预测.
I have a question about data preprocess that need to be clarified. To my understanding, when we tune hyperparameters and estimate model performance via cross-validation, rather than preprocess the whole dataset, we need to do that within cross-validation. In other words, in cross-validation, we preprocess training folds, then use the same preprocess parameter to process test fold and make predictions.
在下面的示例代码中,当我在 caret::train 中指定 preProcess 时,它会自动执行吗?如果有人能澄清我,真的很感激.
In the example code below, when I specify the preProcess within caret::train, does it automatically do that? Really appreciate it if someone can clarify me on that.
从一些网上资料来看,有些人对整个数据集(trainset)进行预处理,然后使用预处理数据通过交叉验证来调整超参数,这似乎不太正确......
From some online sources, some people preprocess the whole dataset (trainset) and then use the preprocess data to tune hyperparameters via cross-validation, it does not seems to be right....
library(caret)
library(mlbench)
data(PimaIndiansDiabetes)
control <- trainControl(method="cv",
number=5,
preProcOptions = list(pcaComp=4))
grid=expand.grid(mtry=c(1,2,3))
model <- train(diabetes~., data=PimaIndiansDiabetes, method="rf",
preProcess=c("scale", "center", "pca"),
trControl=control,
tuneGrid=grid)
推荐答案
您的担忧恰到好处.引入正偏差的方法有很多种.
Your worries are on the right spot. So many ways to introduce positive bias.
根据插入符号的创建者 Max Kuhn 的说法,在 train
中指定 preProcess
时不会出现数据泄漏:
According to Max Kuhn the creator of caret there is no data leakage when preProcess
is specified in train
:
所有预处理都应用于重新采样的数据版本(例如 10 倍 CV 中的 90%),然后将这些计算应用于保留(剩余的 10%),无需重新计算.
All pre-processing is applied on the resampled version of the data (e.g. 90% in 10-fold CV) and then those calculations are applied to the holdouts (the remaining 10%) with no re-calculation.
来源:https://github.com/topepo/caret/issues/335一个>
这篇关于插入符号中的交叉验证中的预处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!