如何在同一数据子集上更新"lm"或"glm"模型? [英] How to update `lm` or `glm` model on same subset of data?

查看:103
本文介绍了如何在同一数据子集上更新"lm"或"glm"模型?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图拟合两个嵌套模型,然后使用anova函数对它们进行相互测试.使用的命令是:

I am trying to fit two nested models and then test those against each other using anova function. The commands used are:

probit <- glm(grad ~ afqt1 + fhgc + mhgc + hisp + black + male, data=dt, 
    family=binomial(link = "probit"))
nprobit <- update(probit, . ~ . - afqt1)
anova(nprobit, probit, test="Rao")

但是,变量afqt1显然包含NA s,并且由于update调用没有采用相同的数据子集,因此anova()返回错误

However, the variable afqt1 apparently contains NAs and because the update call does not take the same subset of data, anova() returns error

anova.glmlist(c(list(object),dotargs),色散=色散,: 模型并非都适合于相同大小的数据集

Error in anova.glmlist(c(list(object), dotargs), dispersion = dispersion, : models were not all fitted to the same size of dataset

有没有一种简单的方法可以在与原始模型相同的数据集上实现模型的拟合?

Is there a simple way how to achieve refitting the model on the same dataset as the original model?

推荐答案

如注释中所述,一种简单的方法是使用第一次拟合的model数据(例如probit)和update能够覆盖原始调用中的参数.

As suggested in the comments, a straightforward approach to this is to use the model data from the first fit (e.g. probit) and update's ability to overwrite arguments from the original call.

这是一个可复制的示例:

Here's a reproducible example:

data(mtcars)
mtcars[1,2] <- NA
nobs( xa <- lm(mpg~cyl+disp, mtcars) ) 
## [1] 31
nobs( update(xa, .~.-cyl) )  ##not nested
## [1] 32
nobs( xb <- update(xa, .~.-cyl, data=xa$model) )  ##nested
## [1] 31

围绕此定义一个方便包装很容易:

It is easy enough to define a convenience wrapper around this:

update_nested <- function(object, formula., ..., evaluate = TRUE){
    update(object = object, formula. = formula., data = object$model, ..., evaluate = evaluate)
}

这将强制更新调用的data参数重新使用第一个模型拟合中的数据.

This forces the data argument of the updated call to re-use the data from the first model fit.

nobs( xc <- update_nested(xa, .~.-cyl) )
## [1] 31
all.equal(xb, xc)  ##only the `call` component will be different
## [1] "Component "call": target, current do not match when deparsed"
identical(xb[-10], xc[-10])
## [1] TRUE

因此,现在您可以轻松地执行anova:

So now you can easily do anova:

anova(xa, xc)
## Analysis of Variance Table
## 
## Model 1: mpg ~ cyl + disp
## Model 2: mpg ~ disp
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     28 269.97                              
## 2     29 312.96 -1   -42.988 4.4584 0.04378 *
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


建议的另一种方法是在lm()调用之前在数据帧上使用na.omit.起初,我认为这在处理大数据帧(例如1000 cols)和各种规格的大量var(例如〜15 vars)时是不切实际的,但不是因为速度.这种方法需要手动记账,哪些NA应该被清理,哪些不应该被清理,而这正是OP似乎要避免的.最大的缺点是您必须始终使formula与子集数据帧保持同步.


The other approach suggested is na.omit on the data frame prior to the lm() call. At first I thought this would be impractical when dealing with a big data frame (e.g. 1000 cols) and with a large number of vars in the various specifications (e.g ~15 vars), but not because of speed. This approach requires manual bookkeeping of which vars should be sanitized of NAs and which shouldn't, and is precisely what the OP seems intent to avoid. The biggest drawback would be that you must always keep in sync the formula with the subsetted data frame.

但是事实证明,可以很容易地克服这一点:

This however can be overcome rather easily, as it turns out:

data(mtcars)
for(i in 1:ncol(mtcars)) mtcars[i,i] <- NA
nobs( xa <- lm(mpg~cyl + disp + hp + drat + wt + qsec + vs + am + gear + 
                    carb, mtcars) ) 
## [1] 21
nobs( xb <- update(xa, .~.-cyl) )  ##not nested
## [1] 22
nobs( xb <- update_nested(xa, .~.-cyl) )  ##nested
## [1] 21
nobs( xc <- update(xa, .~.-cyl, data=na.omit(mtcars[ , all.vars(formula(xa))])) )  ##nested
## [1] 21
all.equal(xb, xc)
## [1] "Component "call": target, current do not match when deparsed"
identical(xb[-10], xc[-10])
## [1] TRUE

anova(xa, xc)
## Analysis of Variance Table
## 
## Model 1: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## Model 2: mpg ~ disp + hp + drat + wt + qsec + vs + am + gear + carb
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     10 104.08                           
## 2     11 104.42 -1  -0.34511 0.0332 0.8591

这篇关于如何在同一数据子集上更新"lm"或"glm"模型?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆