在 `r` 的 `caret` 包中拆分训练测试 [英] Train test split in `r`'s `caret` package
问题描述
我开始熟悉 r
的 caret
包,但是,来自其他编程语言,这让我很困惑.
I'm getting familiar with r
's caret
package, but, coming from other programming language, it thorougly confused me.
我现在想做的是一个相当简单的机器学习工作流程,即:
What I want to do now is a fairly simple machine learning workflow, which is:
- 取一个训练集,在我的例子中是 iris 数据集
- 将其拆分为训练集和测试集(80-20 拆分)
- 对于从
1
到20
的每个k
,在训练集上训练k
最近邻分类器 - 在测试集上测试
- Take a training set, in my case the iris dataset
- Split it into a training and test set (a 80-20 split)
- For every
k
from1
to20
, train thek
nearest neighbor classifier on the training set - Test it on the test set
我了解如何做第一部分,因为 iris
已经加载.然后,第二部分通过调用
I understand how to do the first part, since iris
is already loaded. Then, the second part is done by calling
a <- createDataPartition(iris$Species, list=FALSE)
training <- iris[a,]
test <- iris[-a,]
现在,我也知道我可以通过调用来训练模型
Now, I also know that I can train the model by calling
library(caret)
knnFit <- train()
knnFit <- train(Species~., data=training, method="knn")
然而,这将导致 r
已经对参数 k
进行了一些优化.当然,我可以限制方法应该尝试的 k
值,比如
However, this will result in r
already performing some optimisation on the parameter k
. Of course, I can limit what values of k
the method should try, with something like
knnFit <- train(Species~., data=training, method="knn", tuneGrid=data.frame(k=1:20))
它工作得很好,但它仍然不完全符合我的要求.对于每个 k
,此代码现在可以执行:
which works just fine, but it still doesn't to exactly what I want it to do. This code will now do, for each k
:
- 从
test
中获取引导样本. - 使用给定样本评估
k
-nn 方法的性能
- take a bootstrap sample from the
test
. - Asses the performance of the
k
-nn method using the given sample
我想要它做什么:
- 对于每个
k
,在我之前构建的同一个训练集上训练模型 - 在我之前构建的同一测试集上评估性能**.
所以我需要类似的东西
knnFit <- train(Species~., training_data=training, test_data=test, method="knn", tuneGrid=data.frame(k=1:20))
但这当然行不通.
我知道我应该用 trainControl
参数做一些事情,但我看到它可能的方法是:
I understand I should do something with the trainControl
parameter, but I see its possible methods are:
"boot", "boot632", "cv", "repeatedcv", "LOOCV", "LGOCV", "none"
而且这些似乎都不是我想要的.
and none of these seems to do what I want.
推荐答案
如果我正确理解了问题,这可以使用 LOCV(Leave-group-out-CV =重复训练/测试拆分)和设置在插入符号内完成训练百分比 p = 0.8
和重复训练/测试拆分为 number = 1
如果你真的想要一个模型适合每个 k
这是在测试集上测试的.设置 number
> 1 将在 number
不同的训练/测试分组上重复评估模型性能.
If I understand the question correctly, this can be done all within caret using LGOCV (Leave-group-out-CV = repeated train/test split) and setting the training percentage p = 0.8
and the repeats of the train/test split to number = 1
if you really want just one model fit per k
that is tested on a testset. Setting number
> 1 will repeatedly assess model performance on number
different train/test splits.
data(iris)
library(caret)
set.seed(123)
mod <- train(Species ~ ., data = iris, method = "knn",
tuneGrid = expand.grid(k=1:20),
trControl = trainControl(method = "LGOCV", p = 0.8, number = 1,
savePredictions = T))
如果savePredictions = T
,则测试集上不同模型所做的所有预测都在mod$pred
中.注意rowIndex
:这些是已经被采样到测试集中的行.对于 k
的所有不同值,它们都是相等的,因此每次都使用相同的训练/测试集.
All predictions that have been made by the different models on the test set are in mod$pred
if savePredictions = T
. Note rowIndex
: These are the rows that have been sampled into the test set. Those are equal for all different values of k
, so the same training/test sets are used every time.
> head(mod$pred)
pred obs rowIndex k Resample
1 setosa setosa 5 1 Resample1
2 setosa setosa 6 1 Resample1
3 setosa setosa 10 1 Resample1
4 setosa setosa 12 1 Resample1
5 setosa setosa 16 1 Resample1
6 setosa setosa 17 1 Resample1
> tail(mod$pred)
pred obs rowIndex k Resample
595 virginica virginica 130 20 Resample1
596 virginica virginica 131 20 Resample1
597 virginica virginica 135 20 Resample1
598 virginica virginica 137 20 Resample1
599 virginica virginica 145 20 Resample1
600 virginica virginica 148 20 Resample1
除非需要某种嵌套的验证程序,否则无需在插入符号之外手动构建训练/测试集.您还可以通过 plot(mod)
为 k
的不同值绘制验证曲线.
There's no need to construct train/test sets manually outside of caret unless some kind of nested validation prodedure is desired. You can also plot the validation-curve for the different values of k
by plot(mod)
.
这篇关于在 `r` 的 `caret` 包中拆分训练测试的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!