在选定范围的数据上创建数据分区，以将其输入caret :: train函数进行交叉验证 [英] Creating data partitions over a selected range of data to be fed into caret::train function for cross-validation

查看：167 发布时间：2020/6/11 2:09:56 r cross-validation r-caret data-partitioning

本文介绍了在选定范围的数据上创建数据分区，以将其输入caret :: train函数进行交叉验证的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想为下面的数据框创建千斤顶数据分区，并在 caret :: train 中使用这些分区（例如 caret :: groupKFold（）生成）。但是，要注意的是，我想将测试点限制为大于16天，而将其余数据用作训练集。

I want to create jack-knife data partitions for the data frame below, with the partitions to be used in caret::train (like the caret::groupKFold() produces). However, the catch is that I want to restrict the test points to say greater than 16 days, whilst using the remainder of these data as the training set.

df <- data.frame(Effect = seq(from = 0.05, to = 1, by = 0.05),
     Time = seq(1:20))

我要这样做的原因是，我只对模型预测上限的方式非常感兴趣，因为是感兴趣的区域。我觉得有一种方法可以使用 caret :: groupKFold（）函数来执行此操作，但是我不确定如何执行此操作。任何帮助将不胜感激。

The reason I want to do this is that I am only really interested in how well the model is predicting the upper bound, as this is the region of interest. I feel like there is a way to do this with the caret::groupKFold() function but I am not sure how. Any help would be greatly appreciated.

每个CV折叠的示例如下：

An example of what each CV fold would comprise:

TrainSet1 <- subset(df, Time != 16)
TestSet1 <- subset(df, Time == 16)

TrainSet2 <- subset(df, Time != 17)
TestSet2 <- subset(df, Time == 17)

TrainSet3 <- subset(df, Time != 18)
TestSet3 <- subset(df, Time == 18)

TrainSet4 <- subset(df, Time != 19)
TestSet4 <- subset(df, Time == 19)

TrainSet5 <- subset(df, Time != 20)
TestSet5 <- subset(df, Time == 20)

尽管采用 caret :: groupKFold 函数输出的格式，以便可以折叠放入插入符号::火车函数：

Albeit in the format that the caret::groupKFold function outputs, so that the folds could be fed into the caret::train function:

CVFolds <- caret::groupKFold(df$Time)
CVFolds

预先感谢！

推荐答案

对于自定义折叠，我可以在内置函数中找到通常不够灵活。因此，我通常使用 tidyverse 制作它们。解决您的问题的一种方法是：

For customized folds I find in built functions are usually not flexible enough. Therefore I usually produce them using tidyverse. One approach to your problem would be:

library(tidyverse)

df %>%
  mutate(id = row_number()) %>% #use the row number as a column called id
  filter(Time > 15) %>% #filter Time as per your need
  split(.$Time)  %>% #split df to a list by Time
  map(~ .x %>% select(id)) #select row numbers for each list element

示例每次每次两行：

df <- data.frame(Effect = seq(from = 0.025, to = 1, by = 0.025),
                 Time = rep(1:20, each = 2))

df %>%
  mutate(id = row_number()) %>%
  filter(Time > 15) %>%
  split(.$Time)  %>%
  map(~ .x %>% select(id)) -> test_folds

test_folds
#output
$`16`
  id
1 31
2 32

$`17`
  id
3 33
4 34

$`18`
  id
5 35
6 36

$`19`
  id
7 37
8 38

$`20`
   id
9  39
10 40

每次的行数不相等

df <- data.frame(Effect = seq(from = 0.55, to = 1, by = 0.05),
                 Time = c(rep(1, 5), rep(2, 3), rep(rep(3, 2))))

df %>%
  mutate(id = row_number()) %>%
  filter(Time > 1) %>%
  split(.$Time)  %>%
  map(~ .x %>% select(id))

$`2`
  id
1  6
2  7
3  8

$`3`
  id
4  9
5 10

现在您可以定义这些保留在 trainControl 中使用参数 indexOut 折叠。

Now you can define these hold out folds inside trainControl with the argument indexOut.

编辑：获得与 caret :: groupKFold 类似的输出可以：

to get similar output as caret::groupKFold one can:

df %>%
  mutate(id = row_number()) %>%
  filter(Time > 1) %>%
  split(.$Time)  %>%
  map(~ .x %>%
        select(id) %>%
        unlist %>%
        unname) %>%
  unname

这篇关于在选定范围的数据上创建数据分区，以将其输入caret :: train函数进行交叉验证的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在选定范围的数据上创建数据分区，以将其输入caret :: train函数进行交叉验证 [英] Creating data partitions over a selected range of data to be fed into caret::train function for cross-validation

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在选定范围的数据上创建数据分区，以将其输入caret :: train函数进行交叉验证 [英] Creating data partitions over a selected range of data to be fed into caret::train function for cross-validation

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭