为什么不能在bestglm的输出上使用cv.glm? [英] Why can't I use cv.glm on the output of bestglm?

查看:362
本文介绍了为什么不能在bestglm的输出上使用cv.glm?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在葡萄酒数据集上进行最佳子集选择,然后我想使用10倍CV得出测试错误率.我使用的代码是-

I am trying to do best subset selection on the wine dataset, and then I want to get the test error rate using 10 fold CV. The code I used is -

cost1 <- function(good, pi=0) mean(abs(good-pi) > 0.5)
res.best.logistic <-
    bestglm(Xy = winedata,
            family = binomial,          # binomial family for logistic
            IC = "AIC",                 # Information criteria
            method = "exhaustive")
res.best.logistic$BestModels
best.cv.err<- cv.glm(winedata,res.best.logistic$BestModel,cost1, K=10)

但是,这给出了错误-

Error in UseMethod("family") : no applicable method for 'family' applied to an object of class "NULL"

我认为$ BestModel是代表最合适的lm对象,这就是手册也说.如果是这样,那为什么在cv.glm的帮助下,为什么不能使用10倍CV在它上面找到测试错误?

I thought that $BestModel is the lm-object that represents the best fit, and that's what manual also says. If that's the case, then why cant I find the test error on it using 10 fold CV, with the help of cv.glm?

使用的数据集是来自 https://archive的白葡萄酒数据集. ics.uci.edu/ml/datasets/Wine+Quality ,并且使用的软件包是cv.glmboot软件包和bestglm软件包.

The dataset used is the white wine dataset from https://archive.ics.uci.edu/ml/datasets/Wine+Quality and the package used is the boot package for cv.glm, and the bestglm package.

数据被处理为-

winedata <- read.delim("winequality-white.csv", sep = ';')
winedata$quality[winedata$quality< 7] <- "0" #recode
winedata$quality[winedata$quality>=7] <- "1" #recode
winedata$quality <- factor(winedata$quality)# Convert the column to a factor
names(winedata)[names(winedata) == "quality"] <- "good"      #rename 'quality' to 'good'

推荐答案

bestglm fit重新排列数据并将响应变量命名为y,因此,如果将其传递回cv.glm,winedata将没有y列之后崩溃

bestglm fit rearranges your data and name your response variable as y, hence if you pass it back into cv.glm, winedata does not have a column y and everything crashes after that

检查什么是最好的课程:

It's always good to check what is the class:

class(res.best.logistic$BestModel)
[1] "glm" "lm" 

但是,如果您查看res.best.logistic$BestModel的调用:

But if you look at the call of res.best.logistic$BestModel:

res.best.logistic$BestModel$call

glm(formula = y ~ ., family = family, data = Xi, weights = weights)

head(res.best.logistic$BestModel$model)
  y fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
1 0           7.0             0.27        0.36           20.7     0.045
2 0           6.3             0.30        0.34            1.6     0.049
3 0           8.1             0.28        0.40            6.9     0.050
4 0           7.2             0.23        0.32            8.5     0.058
5 0           7.2             0.23        0.32            8.5     0.058
6 0           8.1             0.28        0.40            6.9     0.050
  free.sulfur.dioxide density   pH sulphates
1                  45  1.0010 3.00      0.45
2                  14  0.9940 3.30      0.49
3                  30  0.9951 3.26      0.44
4                  47  0.9956 3.19      0.40
5                  47  0.9956 3.19      0.40
6                  30  0.9951 3.26      0.44

您可以在通话等中替换事物,但这太混乱了.拟合并不昂贵,因此可以对winedata进行拟合并将其传递给cv.glm:

You can substitute things in the call etc, but it's too much of a mess. Fitting is not costly, so make a fit on winedata and pass it to cv.glm:

best_var = apply(res.best.logistic$BestModels[,-ncol(winedata)],1,which)
# take the variable names for best model
best_var = names(best_var[[1]])
new_form = as.formula(paste("good ~", paste(best_var,collapse="+")))
fit = glm(new_form,winedata,family="binomial")

best.cv.err<- cv.glm(winedata,fit,cost1, K=10)

这篇关于为什么不能在bestglm的输出上使用cv.glm?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆