glm()模型的交叉验证 [英] Cross validation for glm() models

查看:311
本文介绍了glm()模型的交叉验证的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试对我先前在R中构建的某些glm模型进行10倍交叉验证。我对 cv.glm() boot 包中的$ c>函数。当我提供以下公式时:

I'm trying to do a 10-fold cross validation for some glm models that I have built earlier in R. I'm a little confused about the cv.glm() function in the boot package, although I've read a lot of help files. When I provide the following formula:

library(boot)
cv.glm(data, glmfit, K=10)

此处的数据参数是指整个数据集还是仅指测试集?

Does the "data" argument here refer to the whole dataset or only to the test set?

到目前为止,我所看到的示例都将数据参数作为测试集,但并没有真正的意义,例如为什么在同一数据上做10倍测试集?他们都会给出完全相同的结果(我想是!)。

The examples I have seen so far provide the "data" argument as the test set but that did not really make sense, such as why do 10-folds on the same test set? They are all going to give exactly the same result (I assume!).

不幸的是,?cv.glm 解释


data:包含数据的矩阵或数据帧。行应为
例,列应对应于变量,其中之一是
响应

data: A matrix or data frame containing the data. The rows should be cases and the columns correspond to variables, one of which is the response

My其他问题将与 $ delta [1] 结果有关。这是10次试验中的平均预测误差吗?如果我想每次折叠都出错怎么办?

My other question would be about the $delta[1] result. Is this the average prediction error over the 10 trials? What if I want to get the error for each fold?

这是我的脚本的样子:

##data partitioning
sub <- sample(nrow(data), floor(nrow(x) * 0.9))
training <- data[sub, ]
testing <- data[-sub, ]

##model building
model <- glm(formula = groupcol ~ var1 + var2 + var3,
        family = "binomial", data = training)

##cross-validation
cv.glm(testing, model, K=10)


推荐答案

对于使用各种程序包的十折交叉验证方法,我总是有些谨慎。我有自己的简单脚本,可以为任何机器学习包手动创建测试和培训分区:

I am always a little cautious about using various packages 10-fold cross validation methods. I have my own simple script to create the test and training partitions manually for any machine learning package:

#Randomly shuffle the data
yourData<-yourData[sample(nrow(yourData)),]

#Create 10 equally size folds
folds <- cut(seq(1,nrow(yourData)),breaks=10,labels=FALSE)

#Perform 10 fold cross validation
for(i in 1:10){
    #Segement your data by fold using the which() function 
    testIndexes <- which(folds==i,arr.ind=TRUE)
    testData <- yourData[testIndexes, ]
    trainData <- yourData[-testIndexes, ]
    #Use test and train data partitions however you desire...
}

这篇关于glm()模型的交叉验证的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆