在选定范围的数据上创建数据分区,以将其输入caret :: train函数进行交叉验证 [英] Creating data partitions over a selected range of data to be fed into caret::train function for cross-validation

查看:167
本文介绍了在选定范围的数据上创建数据分区,以将其输入caret :: train函数进行交叉验证的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想为下面的数据框创建千斤顶数据分区,并在 caret :: train 中使用这些分区(例如 caret :: groupKFold()生成)。但是,要注意的是,我想将测试点限制为大于16天,而将其余数据用作训练集。

I want to create jack-knife data partitions for the data frame below, with the partitions to be used in caret::train (like the caret::groupKFold() produces). However, the catch is that I want to restrict the test points to say greater than 16 days, whilst using the remainder of these data as the training set.

df <- data.frame(Effect = seq(from = 0.05, to = 1, by = 0.05),
     Time = seq(1:20))

我要这样做的原因是,我只对模型预测上限的方式非常感兴趣,因为是感兴趣的区域。我觉得有一种方法可以使用 caret :: groupKFold()函数来执行此操作,但是我不确定如何执行此操作。任何帮助将不胜感激。

The reason I want to do this is that I am only really interested in how well the model is predicting the upper bound, as this is the region of interest. I feel like there is a way to do this with the caret::groupKFold() function but I am not sure how. Any help would be greatly appreciated.

每个CV折叠的示例如下:

An example of what each CV fold would comprise:

TrainSet1 <- subset(df, Time != 16)
TestSet1 <- subset(df, Time == 16)

TrainSet2 <- subset(df, Time != 17)
TestSet2 <- subset(df, Time == 17)

TrainSet3 <- subset(df, Time != 18)
TestSet3 <- subset(df, Time == 18)

TrainSet4 <- subset(df, Time != 19)
TestSet4 <- subset(df, Time == 19)

TrainSet5 <- subset(df, Time != 20)
TestSet5 <- subset(df, Time == 20)

尽管采用 caret :: groupKFold 函数输出的格式,以便可以折叠放入插入符号::火车函数:

Albeit in the format that the caret::groupKFold function outputs, so that the folds could be fed into the caret::train function:

CVFolds <- caret::groupKFold(df$Time)
CVFolds

预先感谢!

推荐答案

对于自定义折叠,我可以在内置函数中找到通常不够灵活。因此,我通常使用 tidyverse 制作它们。解决您的问题的一种方法是:

For customized folds I find in built functions are usually not flexible enough. Therefore I usually produce them using tidyverse. One approach to your problem would be:

library(tidyverse)

df %>%
  mutate(id = row_number()) %>% #use the row number as a column called id
  filter(Time > 15) %>% #filter Time as per your need
  split(.$Time)  %>% #split df to a list by Time
  map(~ .x %>% select(id)) #select row numbers for each list element

示例每次每次两行:

df <- data.frame(Effect = seq(from = 0.025, to = 1, by = 0.025),
                 Time = rep(1:20, each = 2))

df %>%
  mutate(id = row_number()) %>%
  filter(Time > 15) %>%
  split(.$Time)  %>%
  map(~ .x %>% select(id)) -> test_folds

test_folds
#output
$`16`
  id
1 31
2 32

$`17`
  id
3 33
4 34

$`18`
  id
5 35
6 36

$`19`
  id
7 37
8 38

$`20`
   id
9  39
10 40

每次的行数不相等

df <- data.frame(Effect = seq(from = 0.55, to = 1, by = 0.05),
                 Time = c(rep(1, 5), rep(2, 3), rep(rep(3, 2))))

df %>%
  mutate(id = row_number()) %>%
  filter(Time > 1) %>%
  split(.$Time)  %>%
  map(~ .x %>% select(id))

$`2`
  id
1  6
2  7
3  8

$`3`
  id
4  9
5 10

现在您可以定义这些保留在 trainControl 中使用参数 indexOut 折叠。

Now you can define these hold out folds inside trainControl with the argument indexOut.

编辑:获得与 caret :: groupKFold 类似的输出可以:

to get similar output as caret::groupKFold one can:

df %>%
  mutate(id = row_number()) %>%
  filter(Time > 1) %>%
  split(.$Time)  %>%
  map(~ .x %>%
        select(id) %>%
        unlist %>%
        unname) %>%
  unname

这篇关于在选定范围的数据上创建数据分区,以将其输入caret :: train函数进行交叉验证的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆