在选定范围的数据上创建数据分区,以将其输入caret :: train函数进行交叉验证 [英] Creating data partitions over a selected range of data to be fed into caret::train function for cross-validation
问题描述
我想为下面的数据框创建千斤顶数据分区,并在 caret :: train
中使用这些分区(例如 caret :: groupKFold()
生成)。但是,要注意的是,我想将测试点限制为大于16天,而将其余数据用作训练集。
I want to create jack-knife data partitions for the data frame below, with the partitions to be used in caret::train
(like the caret::groupKFold()
produces). However, the catch is that I want to restrict the test points to say greater than 16 days, whilst using the remainder of these data as the training set.
df <- data.frame(Effect = seq(from = 0.05, to = 1, by = 0.05),
Time = seq(1:20))
我要这样做的原因是,我只对模型预测上限的方式非常感兴趣,因为是感兴趣的区域。我觉得有一种方法可以使用 caret :: groupKFold()
函数来执行此操作,但是我不确定如何执行此操作。任何帮助将不胜感激。
The reason I want to do this is that I am only really interested in how well the model is predicting the upper bound, as this is the region of interest. I feel like there is a way to do this with the caret::groupKFold()
function but I am not sure how. Any help would be greatly appreciated.
每个CV折叠的示例如下:
An example of what each CV fold would comprise:
TrainSet1 <- subset(df, Time != 16)
TestSet1 <- subset(df, Time == 16)
TrainSet2 <- subset(df, Time != 17)
TestSet2 <- subset(df, Time == 17)
TrainSet3 <- subset(df, Time != 18)
TestSet3 <- subset(df, Time == 18)
TrainSet4 <- subset(df, Time != 19)
TestSet4 <- subset(df, Time == 19)
TrainSet5 <- subset(df, Time != 20)
TestSet5 <- subset(df, Time == 20)
尽管采用 caret :: groupKFold
函数输出的格式,以便可以折叠放入插入符号::火车
函数:
Albeit in the format that the caret::groupKFold
function outputs, so that the folds could be fed into the caret::train
function:
CVFolds <- caret::groupKFold(df$Time)
CVFolds
预先感谢!
推荐答案
对于自定义折叠,我可以在内置函数中找到通常不够灵活。因此,我通常使用 tidyverse
制作它们。解决您的问题的一种方法是:
For customized folds I find in built functions are usually not flexible enough. Therefore I usually produce them using tidyverse
. One approach to your problem would be:
library(tidyverse)
df %>%
mutate(id = row_number()) %>% #use the row number as a column called id
filter(Time > 15) %>% #filter Time as per your need
split(.$Time) %>% #split df to a list by Time
map(~ .x %>% select(id)) #select row numbers for each list element
示例每次每次两行:
df <- data.frame(Effect = seq(from = 0.025, to = 1, by = 0.025),
Time = rep(1:20, each = 2))
df %>%
mutate(id = row_number()) %>%
filter(Time > 15) %>%
split(.$Time) %>%
map(~ .x %>% select(id)) -> test_folds
test_folds
#output
$`16`
id
1 31
2 32
$`17`
id
3 33
4 34
$`18`
id
5 35
6 36
$`19`
id
7 37
8 38
$`20`
id
9 39
10 40
每次的行数不相等
df <- data.frame(Effect = seq(from = 0.55, to = 1, by = 0.05),
Time = c(rep(1, 5), rep(2, 3), rep(rep(3, 2))))
df %>%
mutate(id = row_number()) %>%
filter(Time > 1) %>%
split(.$Time) %>%
map(~ .x %>% select(id))
$`2`
id
1 6
2 7
3 8
$`3`
id
4 9
5 10
现在您可以定义这些保留在 trainControl
中使用参数 indexOut
折叠。
Now you can define these hold out folds inside trainControl
with the argument indexOut
.
编辑:获得与 caret :: groupKFold
类似的输出可以:
to get similar output as caret::groupKFold
one can:
df %>%
mutate(id = row_number()) %>%
filter(Time > 1) %>%
split(.$Time) %>%
map(~ .x %>%
select(id) %>%
unlist %>%
unname) %>%
unname
这篇关于在选定范围的数据上创建数据分区,以将其输入caret :: train函数进行交叉验证的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!