指定要在插入符号::: train函数中使用的免选(jack-knife)交叉验证中使用的选定数据范围 [英] Specifiying a selected range of data to be used in leave-one-out (jack-knife) cross-validation for use in the caret::train function

查看:257
本文介绍了指定要在插入符号::: train函数中使用的免选(jack-knife)交叉验证中使用的选定数据范围的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

此问题基于我在这里提出的问题:在选定数据范围内创建数据分区,以将其输入caret :: train函数进行交叉验证)。

This question builds on the question that I asked here: Creating data partitions over a selected range of data to be fed into caret::train function for cross-validation).

我正在使用的数据如下所示:

The data I am working with looks like this:

df <- data.frame(Effect = rep(seq(from = 0.05, to = 1, by = 0.05), each = 5), Time = rep(c(1:20,1:20), each = 5), Replicate = c(1:5))

基本上我想做的是创建自定义分区,例如生成的分区caret :: groupKFold 函数,但对于这些折叠必须在指定范围内(即> 15天),并且对于每一折叠保留一个点作为测试集,并将所有其他数据作为保留用于训练。它将在每次迭代中重复进行,直到指定范围内的每个点都已用作测试集为止。 @Missuse为此编写了一些代码,该代码接近上面链接中此问题的期望输出。

Essentially what I would like to do is create custom partitions, like those generated by the caret::groupKFold function but for these folds to be over a specified range (i.e. > 15 days) and for each fold to with-hold one point to be a test set and with all other data to be used for training. This would be repeated at each iteration till every point in the specified range has been used as a test set. @Missuse wrote some code towards this end which gets close to the desired output for this question in the above link.

我会尝试向您显示所需的输出,但是老实说,caret :: groupKFold函数的输出使我感到困惑,因此希望上面的描述就足够了。

I would try and show you the desired output but in all honesty the caret::groupKFold functions output confuses me so hopefully the above description will suffice. Happy to try and clarify though!

推荐答案

这是您可以使用 tidyverse创建所需分区的一种方法

library(tidyverse)

df %>%
  mutate(id = row_number()) %>% #create a column called id which will hold the row numbers
  filter(Time > 15) %>% #subset data frame according to your description 
  split(.$id)  %>% #split the data frame into lists by id (row number)
  map(~ .x %>% select(id) %>% #clean up so it works with indexOut argument in trainControl
        unlist %>%
        unname) -> folds_cv

编辑:看来 indexOut 参数确实可以不能达到预期的效果,但是 index 参数可以使 folds_cv 这样做,就可以使用<$ c求逆$ c> setdiff :

it seems indexOut argument does not perform as expected, but the index argument does so after making folds_cv one can just get the inverse using setdiff:

folds_cv <- lapply(folds_cv, function(x) setdiff(1:nrow(df), x))

现在:

test_control <- trainControl(index = folds_cv,
                             savePredictions = "final")


quad.lm2 <- train(Time ~ Effect,
                  data = df,
                  method = "lm",
                  trControl = test_control)

有警告:

Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,  :
  There were missing values in resampled performance measures.
> quad.lm2
Linear Regression 

200 samples
  1 predictor

No pre-processing
Resampling: Bootstrapped (50 reps) 
Summary of sample sizes: 199, 199, 199, 199, 199, 199, ... 
Resampling results:

  RMSE          Rsquared  MAE         
  3.552714e-16  NaN       3.552714e-16

Tuning parameter 'intercept' was held constant at a value of TRUE

,因此每个重新采样均使用199行并在1上进行了预测我们想要一次保留的所有50行。可以在以下位置进行验证:

so each re-sample used 199 rows and predicted on 1, repeating for all 50 rows which we wanted to hold out at a time. This can be verified in:

quad.lm2$pred

为什么缺少 Rsquared 我不确定我是否会更深入。

Why Rsquared is missing I am not sure I will dig a bit deeper.

这篇关于指定要在插入符号::: train函数中使用的免选(jack-knife)交叉验证中使用的选定数据范围的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆