为什么不能在bestglm的输出上使用cv.glm? [英] Why can't I use cv.glm on the output of bestglm?
问题描述
我试图在葡萄酒数据集上进行最佳子集选择,然后我想使用10倍CV得出测试错误率.我使用的代码是-
I am trying to do best subset selection on the wine dataset, and then I want to get the test error rate using 10 fold CV. The code I used is -
cost1 <- function(good, pi=0) mean(abs(good-pi) > 0.5)
res.best.logistic <-
bestglm(Xy = winedata,
family = binomial, # binomial family for logistic
IC = "AIC", # Information criteria
method = "exhaustive")
res.best.logistic$BestModels
best.cv.err<- cv.glm(winedata,res.best.logistic$BestModel,cost1, K=10)
但是,这给出了错误-
Error in UseMethod("family") : no applicable method for 'family' applied to an object of class "NULL"
我认为$ BestModel是代表最合适的lm对象,这就是手册也说.如果是这样,那为什么在cv.glm的帮助下,为什么不能使用10倍CV在它上面找到测试错误?
I thought that $BestModel is the lm-object that represents the best fit, and that's what manual also says. If that's the case, then why cant I find the test error on it using 10 fold CV, with the help of cv.glm?
使用的数据集是来自 https://archive的白葡萄酒数据集. ics.uci.edu/ml/datasets/Wine+Quality ,并且使用的软件包是cv.glm
的boot
软件包和bestglm
软件包.
The dataset used is the white wine dataset from https://archive.ics.uci.edu/ml/datasets/Wine+Quality and the package used is the boot
package for cv.glm
, and the bestglm
package.
数据被处理为-
winedata <- read.delim("winequality-white.csv", sep = ';')
winedata$quality[winedata$quality< 7] <- "0" #recode
winedata$quality[winedata$quality>=7] <- "1" #recode
winedata$quality <- factor(winedata$quality)# Convert the column to a factor
names(winedata)[names(winedata) == "quality"] <- "good" #rename 'quality' to 'good'
推荐答案
bestglm fit重新排列数据并将响应变量命名为y,因此,如果将其传递回cv.glm,winedata将没有y列之后崩溃
bestglm fit rearranges your data and name your response variable as y, hence if you pass it back into cv.glm, winedata does not have a column y and everything crashes after that
检查什么是最好的课程:
It's always good to check what is the class:
class(res.best.logistic$BestModel)
[1] "glm" "lm"
但是,如果您查看res.best.logistic$BestModel
的调用:
But if you look at the call of res.best.logistic$BestModel
:
res.best.logistic$BestModel$call
glm(formula = y ~ ., family = family, data = Xi, weights = weights)
head(res.best.logistic$BestModel$model)
y fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
1 0 7.0 0.27 0.36 20.7 0.045
2 0 6.3 0.30 0.34 1.6 0.049
3 0 8.1 0.28 0.40 6.9 0.050
4 0 7.2 0.23 0.32 8.5 0.058
5 0 7.2 0.23 0.32 8.5 0.058
6 0 8.1 0.28 0.40 6.9 0.050
free.sulfur.dioxide density pH sulphates
1 45 1.0010 3.00 0.45
2 14 0.9940 3.30 0.49
3 30 0.9951 3.26 0.44
4 47 0.9956 3.19 0.40
5 47 0.9956 3.19 0.40
6 30 0.9951 3.26 0.44
您可以在通话等中替换事物,但这太混乱了.拟合并不昂贵,因此可以对winedata进行拟合并将其传递给cv.glm:
You can substitute things in the call etc, but it's too much of a mess. Fitting is not costly, so make a fit on winedata and pass it to cv.glm:
best_var = apply(res.best.logistic$BestModels[,-ncol(winedata)],1,which)
# take the variable names for best model
best_var = names(best_var[[1]])
new_form = as.formula(paste("good ~", paste(best_var,collapse="+")))
fit = glm(new_form,winedata,family="binomial")
best.cv.err<- cv.glm(winedata,fit,cost1, K=10)
这篇关于为什么不能在bestglm的输出上使用cv.glm?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!