glm()模型的交叉验证 [英] Cross validation for glm() models
问题描述
我正在尝试对我先前在R中构建的某些glm模型进行10倍交叉验证。我对 cv.glm()$ c有点困惑尽管我已经阅读了很多帮助文件,但
boot
包中的$ c>函数。当我提供以下公式时:
I'm trying to do a 10-fold cross validation for some glm models that I have built earlier in R. I'm a little confused about the cv.glm()
function in the boot
package, although I've read a lot of help files. When I provide the following formula:
library(boot)
cv.glm(data, glmfit, K=10)
此处的数据参数是指整个数据集还是仅指测试集?
Does the "data" argument here refer to the whole dataset or only to the test set?
到目前为止,我所看到的示例都将数据参数作为测试集,但并没有真正的意义,例如为什么在同一数据上做10倍测试集?他们都会给出完全相同的结果(我想是!)。
The examples I have seen so far provide the "data" argument as the test set but that did not really make sense, such as why do 10-folds on the same test set? They are all going to give exactly the same result (I assume!).
不幸的是,?cv.glm
解释
data:包含数据的矩阵或数据帧。行应为
例,列应对应于变量,其中之一是
响应
data: A matrix or data frame containing the data. The rows should be cases and the columns correspond to variables, one of which is the response
My其他问题将与 $ delta [1]
结果有关。这是10次试验中的平均预测误差吗?如果我想每次折叠都出错怎么办?
My other question would be about the $delta[1]
result. Is this the average prediction error over the 10 trials? What if I want to get the error for each fold?
这是我的脚本的样子:
##data partitioning
sub <- sample(nrow(data), floor(nrow(x) * 0.9))
training <- data[sub, ]
testing <- data[-sub, ]
##model building
model <- glm(formula = groupcol ~ var1 + var2 + var3,
family = "binomial", data = training)
##cross-validation
cv.glm(testing, model, K=10)
推荐答案
对于使用各种程序包的十折交叉验证方法,我总是有些谨慎。我有自己的简单脚本,可以为任何机器学习包手动创建测试和培训分区:
I am always a little cautious about using various packages 10-fold cross validation methods. I have my own simple script to create the test and training partitions manually for any machine learning package:
#Randomly shuffle the data
yourData<-yourData[sample(nrow(yourData)),]
#Create 10 equally size folds
folds <- cut(seq(1,nrow(yourData)),breaks=10,labels=FALSE)
#Perform 10 fold cross validation
for(i in 1:10){
#Segement your data by fold using the which() function
testIndexes <- which(folds==i,arr.ind=TRUE)
testData <- yourData[testIndexes, ]
trainData <- yourData[-testIndexes, ]
#Use test and train data partitions however you desire...
}
这篇关于glm()模型的交叉验证的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!