为循环中的交叉验证准备测试/训练集 [英] Preparing Test/Train sets for Cross Validaton in a loop

查看：0 发布时间：2022/9/6 11:48:55 r for-loop cross-validation

本文介绍了为循环中的交叉验证准备测试/训练集的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试构建用于进行交叉验证的测试和培训组。我总共有95个个人ID，并尝试按如下方式完成任务：

# create 95 unique IDs as individuals
set.seed(1)
indv <- stringi::stri_rand_strings(95, 4)

# specify Kfold
n.folds <- 5

folds <- cut(1:length(indv), breaks = n.folds, labels = FALSE)
# randomise the folds
folds <- sample(folds, length(folds)) 

samples.train <- list()
samples.test <- list()
foldSet <- list()

kfold.df <- data.frame("IID" = indv)

for (f in 1:n.folds) {
          samples.train[[f]] <- indv[folds != f]
          samples.test[[f]] <- indv[folds == f]

# replace to x (test) if the corresponding value is TRUE, and to y (train) if it is FALSE.
foldSet[[f]] <- ifelse(kfold.df$IID %in% 
                  samples.test[[f]], "test", "train")

# combine foldSet to datafarme.
kfold.df[[f]] <- cbind(kfold.df, foldSet[[f]])
}

目标是准备5个测试和训练样本集来进行建模。但我遇到此错误消息：

Error in data.frame(..., check.names = FALSE) : 
arguments imply differing number of rows: 95, 2

此外，虽然samples.train和samples.test是正确的，但foldSet的输出并不像预期的那样。你能帮我把这个循环弄好吗？

更新：下面是在创建foldSet：

时不使用通配符的for循环

for (f in 1:n.folds) {
samples.train[[f]] <- indv[folds != f]
samples.test[[f]] <- indv[folds == f]

foldSet <<- ifelse(kfold.df$IID %in% samples.test[[f]], "test", "train")
# combine foldSet to datafarme.
kfold.df <<- cbind(kfold.df, foldSet)
}

通过执行循环，您将发现kfold.df作为一个数据帧列出了所有五个折叠测试/训练随机集。对于每个迭代，我希望创建与f相对应的测试和训练集，因此，在五次迭代之后，我将可以访问每个文件夹的训练/测试集，用于循环中的下一个操作，如kfold.df[foldSet == "train", "IID"]。我需要这个访问权，因为我想使用它来细分另一个更大的矩阵，基于训练和测试invd的每个文件夹，为应用到回归模型做准备。这就是为什么我使用foldSet的通配符来使循环能够自己创建，但我未能管理它。

第1部分

如果我理解正确的话，这是关于您要查找的内容(减去字符串)。我还介绍了如何将其与实际数据一起使用。

library(tidyverse)

giveMe <- function(rowCt, nfolds){
  # set.seed(235) # removed seed after establishing working function to incite
  #  the expected randomness

  folds <- cut(1:rowCt, breaks = nfolds, labels = F)
  # randomise the folds
  folds <- sample(folds, length(folds)) 
  # create the folds' sets
  kfold.df <- map_dfc(1:nfolds,
                      ~ifelse(folds != .x, T, F)) %>% 
  setNames(., paste0("foldSet_",1:nfolds)) %>%  # name each field
  add_column(IID = 1:rowCt, .before = 1) # add indices to the left

  return(kfold.df) # return a data frame
}

given <- giveMe(95, 5)

giveMore <- giveMe(nrow(iris), 5) # uses the built-in iris data set

第2部分

您只需创建您的随机折叠序列并将其与模型一起使用，您不需要将它们堆叠在数据框中。您必须对模型进行相同次数的循环，为什么不同时进行呢？

folds <- sample(cut(1:nrow(iris), 5, # no seed-- random on purpose
                    labels = F))

tellMe <- map(1:5, # the folds start in col 2
              ~lm(Sepal.Length~., 
                  iris[ifelse(folds != .x,
                              T, F), 
                       1:4])) # dropped 'Species' groups' issue

查看模型性能：

map_dfr(1:5, .f = function(x){
  y = tellMe[[x]]
  sigma = sigma(y)
  rsq = summary(y)$adj.r.squared
  c(sigma = sigma, rsq = rsq)
})
# # A tibble: 5 × 2
#   sigma   rsq
#   <dbl> <dbl>
# 1 0.334 0.844
# 2 0.309 0.869
# 3 0.302 0.846
# 4 0.330 0.847
# 5 0.295 0.872

预测和检查测试性能

# create a list of the predictec values from the test data
showMe <- map(1:5,
              ~predict(tellMe[[.x]], 
                       iris[ifelse(folds == .x,
                                   T, F), 1:4]))

# Grab comparable metrics like those from the models
map_dfr(1:5,
        .f = function(x){
          A = iris[ifelse(folds == x, T, F), ]$Sepal.Length
          P = showMe[[x]]
          sigma = sqrt(sum((A - P)^2) / length(A))
          rsq = cor(A, P)^2
          c(sigma = sigma, rsq = rsq)
        })
# # A tibble: 5 × 2
#   sigma   rsq
#   <dbl> <dbl>
# 1 0.232 0.919
# 2 0.342 0.774
# 3 0.366 0.884
# 4 0.250 0.906
# 5 0.384 0.790

第3部分

这里我将使用caret库。但是，还有很多其他选择。

library(caret)

set.seed(1)
# split training and testing 70/30%
tr <- createDataPartition(iris$Species, p = .7, list = F)

# set up 5-fold val
trC <- trainControl(method = "cv", number = 5)

# train the model
fit <- train(Sepal.Length~., iris[tr, ], 
             method = "lm", 
             trControl = trC)
summary(fit)
# truncated results best model:
# Residual standard error: 0.2754 on 39 degrees of freedom
# Multiple R-squared:  0.9062,  Adjusted R-squared:  0.8941 

fit.p <- predict(fit, iris[-tr,])
postResample(fit.p, iris[-tr, ]$Sepal.Length)
#      RMSE  Rsquared       MAE 
# 0.2795920 0.8925574 0.2302402

如果您想要查看每个折叠的性能，也可以这样做。

fit$resample
#        RMSE  Rsquared       MAE Resample
# 1 0.3629901 0.7911634 0.2822708    Fold1
# 2 0.3680954 0.8888947 0.2960464    Fold2
# 3 0.3508317 0.8394489 0.2709989    Fold3
# 4 0.2548549 0.8954633 0.1960375    Fold4
# 5 0.3396910 0.8661239 0.3187768    Fold5

这篇关于为循环中的交叉验证准备测试/训练集的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

为循环中的交叉验证准备测试/训练集 [英] Preparing Test/Train sets for Cross Validaton in a loop

问题描述

推荐答案

第1部分

第2部分

第3部分

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

为循环中的交叉验证准备测试/训练集 [英] Preparing Test/Train sets for Cross Validaton in a loop

问题描述

推荐答案

第1部分

第2部分

第3部分

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭