biglm和lm之间的AIC不同 [英] AIC different between biglm and lm

查看:264
本文介绍了biglm和lm之间的AIC不同的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在尝试使用biglm在大型数据集(大约60,000,000行)上运行线性回归.我想使用AIC进行模型选择.但是,我在较小的数据集上使用biglm时发现,biglm返回的AIC变量与lm返回的变量不同.这甚至适用于biglm帮助中的示例.

I have been trying to use biglm to run linear regressions on a large dataset (approx 60,000,000 lines). I want to use AIC for model selection. However I discovered when playing with biglm on smaller datasets that the AIC variables returned by biglm are different from those returned by lm. This even applies to the example in the biglm help.

data(trees)
ff<-log(Volume)~log(Girth)+log(Height)

chunk1<-trees[1:10,]
chunk2<-trees[11:20,]
chunk3<-trees[21:31,]

library(biglm)
a <- biglm(ff,chunk1)
a <- update(a,chunk2)
a <- update(a,chunk3)

AIC(a)#48.18546

a_lm <- lm(ff, trees)
AIC(a_lm)#-62.71125

有人可以解释一下这里发生了什么吗?用biglm生成的AIC是否可以安全地用于比较同一数据集上的biglm模型?

Can someone please explain what is happening here? Are the AICs generated with biglm safe to use for comparing biglm models on the same dataset?

推荐答案

tl; dr 在我看来,biglm类对象的AIC方法中存在一个非常明显的错误(更具体地说,在更新方法中),在当前版本(0.9-1)中,但是biglm包的作者是一个聪明,有经验的人,并且biglm被广泛使用,所以也许我缺少一些东西.搜寻"biglm AIC df.resid",似乎这已经是讨论过的方式早在2009年? 更新:软件包作者/维护者通过电子邮件报告这确实是一个错误.

tl;dr it looks to me like there is a pretty obvious bug in the AIC method for biglm-class objects (more specifically, in the update method), in the current (0.9-1) version, but the author of the biglm package is a smart, experienced guy, and biglm is widely used, so perhaps I'm missing something. Googling for "biglm AIC df.resid", it seems this has been discussed way back in 2009? Update: the package author/maintainer reports via e-mail that this is indeed a bug.

这里似乎发生了一些有趣的事情.在模型之间,模型之间在AIC中的差异应该相同,无论使用了什么常量但计算了多少参数(因为这些常量和参数计数在内应保持一致建模框架...)

Something funny seems to be going on here. The differences in AIC between models should be the same across modeling frameworks, whatever the constants that have been used and however parameters are counted (because these constants and parameter-counting should be consistent within modeling frameworks ...)

原始示例:

data(trees)
ff <- log(Volume)~log(Girth)+log(Height)
chunk1<-trees[1:10,]
chunk2<-trees[11:20,]
chunk3<-trees[21:31,]
library(biglm)
a <- biglm(ff,chunk1)
a <- update(a,chunk2)
a <- update(a,chunk3)
a_lm <- lm(ff, trees)

现在拟合简化模型:

ff2 <- log(Volume)~log(Girth)    
a2 <- biglm(ff2, chunk1)
a2 <- update(a2, chunk2)
a2 <- update(a2 ,chunk3)
a2_lm <- lm(ff2,trees)

现在比较AIC值:

AIC(a)-AIC(a2)
## [1] 1.80222

AIC(a_lm)-AIC(a2_lm)
## [1] -20.50022

检查我们是否搞砸了:

all.equal(coef(a),coef(a_lm))  ## TRUE
all.equal(coef(a2),coef(a2_lm))  ## TRUE

看一下引擎盖:

biglm:::AIC.biglm
## function (object, ..., k = 2) 
##    deviance(object) + k * (object$n - object$df.resid)

原则上,这是正确的公式(观测值减去残差df应该是所拟合的参数的数量),但深入研究,似乎对象的$df.resid组件未正确更新:

In principle this is the right formula (number of observations minus residual df should be the number of parameters fitted), but digging in, it looks like the $df.resid component of the objects hasn't been updated properly:

a$n  ## 31, correct
a$df.resid  ## 7, only valid before updating!

看着biglm:::update.biglm,我会添加

object$df.resid <- object$df.resid + NROW(mm)

紧挨着读取的行之前或之后

right before or after the line that reads

object$n <- object$n + NROW(mm)

...

对我来说,这似乎是一个相当明显的错误,所以也许我缺少明显的东西,或者它已得到修复.

This seems like a fairly obvious bug to me, so perhaps I'm missing something obvious, or perhaps it has been fixed.

一个简单的解决方法是将您自己的AIC函数定义为

A simple workaround would be to define your own AIC function as

AIC.biglm <- function (object, ..., k = 2) {
    deviance(object) + k * length(coef(object))
}

AIC(a)-AIC(a2)  ## matches results from lm()

(尽管请注意,AIC(a_lm)仍不等于AIC(a),因为stats:::AIC.default()使用2 *对数似然而不是偏差(这两个量度的相加系数不同)...)

(although note that AIC(a_lm) is still not equal to AIC(a), because stats:::AIC.default() uses 2*log-likelihood rather than deviance (these two measures differ in their additive coefficients) ...)

这篇关于biglm和lm之间的AIC不同的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆