在 `r` 的 `caret` 包中拆分训练测试 [英] Train test split in `r`'s `caret` package

查看:31
本文介绍了在 `r` 的 `caret` 包中拆分训练测试的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我开始熟悉 rcaret 包,但是,来自其他编程语言,这让我很困惑.

I'm getting familiar with r's caret package, but, coming from other programming language, it thorougly confused me.

我现在想做的是一个相当简单的机器学习工作流程,即:

What I want to do now is a fairly simple machine learning workflow, which is:

  1. 取一个训练集,在我的例子中是 iris 数据集
  2. 将其拆分为训练集和测试集(80-20 拆分)
  3. 对于从120的每个k,在训练集上训练k最近邻分类器
  4. 在测试集上测试
  1. Take a training set, in my case the iris dataset
  2. Split it into a training and test set (a 80-20 split)
  3. For every k from 1 to 20, train the k nearest neighbor classifier on the training set
  4. Test it on the test set

我了解如何做第一部分,因为 iris 已经加载.然后,第二部分通过调用

I understand how to do the first part, since iris is already loaded. Then, the second part is done by calling

a <- createDataPartition(iris$Species, list=FALSE)
training <- iris[a,]
test <- iris[-a,]

现在,我也知道我可以通过调用来训练模型

Now, I also know that I can train the model by calling

library(caret)
knnFit <- train()
knnFit <- train(Species~., data=training, method="knn")

然而,这将导致 r 已经对参数 k 进行了一些优化.当然,我可以限制方法应该尝试的 k 值,比如

However, this will result in r already performing some optimisation on the parameter k. Of course, I can limit what values of k the method should try, with something like

knnFit <- train(Species~., data=training, method="knn", tuneGrid=data.frame(k=1:20))

它工作得很好,但它仍然不完全符合我的要求.对于每个 k,此代码现在可以执行:

which works just fine, but it still doesn't to exactly what I want it to do. This code will now do, for each k:

  1. test 中获取引导样本.
  2. 使用给定样本评估 k-nn 方法的性能
  1. take a bootstrap sample from the test.
  2. Asses the performance of the k-nn method using the given sample

我想要它做什么:

  1. 对于每个k在我之前构建的同一个训练集上训练模型
  2. 在我之前构建的同一测试集上评估性能**.

所以我需要类似的东西

knnFit <- train(Species~., training_data=training, test_data=test, method="knn", tuneGrid=data.frame(k=1:20))

但这当然行不通.

我知道我应该用 trainControl 参数做一些事情,但我看到它可能的方法是:

I understand I should do something with the trainControl parameter, but I see its possible methods are:

"boot", "boot632", "cv", "repeatedcv", "LOOCV", "LGOCV", "none"

而且这些似乎都不是我想要的.

and none of these seems to do what I want.

推荐答案

如果我正确理解了问题,这可以使用 LOCV(Leave-group-out-CV =重复训练/测试拆分)和设置在插入符号内完成训练百分比 p = 0.8 和重复训练/测试拆分为 number = 1 如果你真的想要一个模型适合每个 k这是在测试集上测试的.设置 number > 1 将在 number 不同的训练/测试分组上重复评估模型性能.

If I understand the question correctly, this can be done all within caret using LGOCV (Leave-group-out-CV = repeated train/test split) and setting the training percentage p = 0.8 and the repeats of the train/test split to number = 1 if you really want just one model fit per k that is tested on a testset. Setting number > 1 will repeatedly assess model performance on number different train/test splits.

data(iris)
library(caret)
set.seed(123)
mod <- train(Species ~ ., data = iris, method = "knn", 
             tuneGrid = expand.grid(k=1:20),
             trControl = trainControl(method = "LGOCV", p = 0.8, number = 1,
                                      savePredictions = T))

如果savePredictions = T,则测试集上不同模型所做的所有预测都在mod$pred 中.注意rowIndex:这些是已经被采样到测试集中的行.对于 k 的所有不同值,它们都是相等的,因此每次都使用相同的训练/测试集.

All predictions that have been made by the different models on the test set are in mod$pred if savePredictions = T. Note rowIndex: These are the rows that have been sampled into the test set. Those are equal for all different values of k, so the same training/test sets are used every time.

> head(mod$pred)
    pred    obs rowIndex k  Resample
1 setosa setosa        5 1 Resample1
2 setosa setosa        6 1 Resample1
3 setosa setosa       10 1 Resample1
4 setosa setosa       12 1 Resample1
5 setosa setosa       16 1 Resample1
6 setosa setosa       17 1 Resample1
> tail(mod$pred)
         pred       obs rowIndex  k  Resample
595 virginica virginica      130 20 Resample1
596 virginica virginica      131 20 Resample1
597 virginica virginica      135 20 Resample1
598 virginica virginica      137 20 Resample1
599 virginica virginica      145 20 Resample1
600 virginica virginica      148 20 Resample1 

除非需要某种嵌套的验证程序,否则无需在插入符号之外手动构建训练/测试集.您还可以通过 plot(mod)k 的不同值绘制验证曲线.

There's no need to construct train/test sets manually outside of caret unless some kind of nested validation prodedure is desired. You can also plot the validation-curve for the different values of k by plot(mod).

这篇关于在 `r` 的 `caret` 包中拆分训练测试的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆