手动创建折页以进行K折交叉验证R [英] Creating folds manually for K-fold cross-validation R

查看:151
本文介绍了手动创建折页以进行K折交叉验证R的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用K = 5建立K折CV回归模型。我尝试使用引导程序包cv.glm函数,但我的电脑内存不足,因为引导程序包始终会在其旁边计算LOOCV MSE。因此,我决定手动进行操作,但是遇到了以下问题。我尝试将数据帧分成等长的5个向量,其中包含我df行数的1/5的样本,但是从第3折起我得到了无法解释的长度。

I am trying to make a K-fold CV regression model using K=5. I tried using the "boot" package cv.glm function, but my pc ran out of memory because the boot package always computes a LOOCV MSE next to it. So I decided to do it manually, but I ran in to the following problem. I try to divide my dataframe into 5 vectors of equal length containing a sample of 1/5 of the rownumbers of my df, but i get unexplainable lengths from the 3rd fold.

a <- sample((d<-1:1000), size = 100, replace = FALSE)
b <- sample((d<-1:1000), size = 100, replace = FALSE)
c <- sample((d<-1:1000), size = 100, replace = FALSE)
df <- data.frame(a,b,c)
head(df)

# create first fold (correct: n=20)
set.seed(5)
K1row <- sample(x = nrow(df), size = (nrow(df)/5), replace = FALSE, prob = NULL)
str(K1row) # int [1:20] 21 68 90 28 11 67 50 76 88 96 ...

# create second fold (still going strong: n=20)
set.seed(5)
K2row <- sample(x = nrow(df[-K1row,]), size = ((nrow(df[-K1row,]))/4), replace = FALSE, prob = NULL)
str(K2row) # int [1:20] 17 55 72 22 8 53 40 59 69 76 ...

# create third fold (this is where it goes wrong: n=21)
set.seed(5)
K3row <- sample(x = nrow(df[-c(K1row,K2row),]), size = ((nrow(df[-c(K1row,K2row),]))/3), replace = FALSE, prob = NULL)
str(K3row) # int [1:21] 13 44 57 18 7 42 31 47 54 60 ...

# create fourth fold (and it gets worse: n=26)
set.seed(5)
K4row <- sample(x = nrow(df[-c(K1row,K2row,K3row),]), size = ((nrow(df[-c(K1row,K2row,K3row),]))/2), replace = FALSE, prob = NULL)
str(K4row) # int [1:26] 11 35 46 14 6 33 25 37 43 5 ...

向量长度似乎从K = 3开始增加。谁能向我解释我做错了吗?我的代码(和推理)似乎合乎逻辑,但结果却相反。.我在此先谢谢!!

The vector length seems to increase from K=3. Can anyone explain to me what I'm doing wrong?! My code (and reasoning) seems logical, but the outcome says otherwise.. My Many thanks in advance!

推荐答案

这是因为K1row和K2row有一些共同点。您正在有效地进行替换采样。下面的方法使用取模来均匀地拆分行。

It's because K1row and K2row have some elements in common. You are effectively sampling with replacement. The method below uses modulo to split up rows evenly.

set.seed(5)
rand <- sample(nrow(df))

K1row <- rand[rand %% 5 + 1 == 1]
K2row <- rand[rand %% 5 + 1 == 2]
K3row <- rand[rand %% 5 + 1 == 3]
K4row <- rand[rand %% 5 + 1 == 4]
K5row <- rand[rand %% 5 + 1 == 5]

这篇关于手动创建折页以进行K折交叉验证R的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆