在模型列表上使用stepAIC [英] use stepAIC on a list of models

查看:254
本文介绍了在模型列表上使用stepAIC的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想对一系列线性模型使用AIC进行逐步回归.想法是使用线性模型列表,然后在每个列表元素上应用stepAIC.它失败了.

大家好,我试图找出问题所在. 我想我找到了问题.但是,我不明白原因. 尝试代码以查看三种情况之间的区别.

require(MASS)
n<-30 
x1<-rnorm(n, mean=0, sd=1) #create rv x1 
x2<-rnorm(n, mean=1, sd=1)
x3<-rnorm(n, mean=2, sd=1)
epsilon<-rnorm(n,mean=0,sd=1) # random error variable 
dat<-as.data.frame(cbind(x1,x2,x3,epsilon)) # combine to a data frame
dat$id<-c(rep(1,10),rep(2,10),rep(3,10)) 
# y is combination from all three x and a random uniform variable
dat$y<-x1+x2+x3+epsilon 
# apply lm() only resulting in a list of models
dat.lin.model.lst<-lapply(split(dat,dat$id),function(d) lm(y~x1+x2+x3,data=d)) 
stepAIC(dat.lin.model.lst[[1]]) # FAIL!!!
# apply function stepAIC(lm())-  works
dat.lin.model.stepAIC.lst<-lapply(split(dat,dat$id),function(d) stepAIC(lm(y~x1+x2+x3,data=d))) 
# create model for particular group with id==1
k<-which(dat$id==1) # manually select records with id==1
lin.model.id1<-lm(dat$y[k]~dat$x1[k]+dat$x2[k]+dat$x3[k]) 
stepAIC(lin.model.id1) # check stepAIC - works!

我很确定stepAIC()需要data.frame"dat"中的原始数据.那就是我以前想的. (希望我是对的) 但是在stepAIC()中没有参数可以传递原始数据帧.显然,对于未包装在列表中的普通模型,足以传递模型. (代码中的最后三行)所以我想知道:
Q1:stepAIC如何知道在哪里可以找到原始数据"dat"(不仅是作为参数传递的模型数据)?
问题2:如何知道stepAIC()中有另一个未在帮助页面中明确指出的参数? (也许我的英语太难找了)
Q3:如何将该参数传递给stepAIC()?

它必须位于apply函数环境中并传递数据的某个地方. lm()或stepAIC()以及指向原始数据的指针/链接都必须在某处丢失.我不太了解R中的环境是做什么的.对我来说,这是一种将局部变量与全局变量隔离的方法.但也许它更复杂.关于上述问题,有人可以向我解释吗?老实说,我没有从 R文档中读到很多东西.任何更好的理解都会对我有所帮助.谢谢.

OLD: 我在数据帧df中有数据,该数据帧可以分为几个子组.为此,我创建了一个名为df $ id的groupID. lm()返回第一个子组的期望系数.我想使用AIC作为每个子组的标准分别进行逐步回归.我使用lmList {lme4},它为每个子组(id)生成一个模型.但是,如果我将stepAIC {MASS}用于列表元素,则会引发错误.见下文.

所以问题是:我的程序/语法有什么错误?我得到的是单个模型的结果,而不是使用lmList创建的模型. lmList()是否在模型上存储的信息与lm()不同?
但在帮助中指出: class"lmList":具有通用模型的lm类的对象列表.

>lme4.list.lm<-lmList(formula=Scherkraft.N~Gap.um+Standoff.um+Voidflaeche.px |df$id,data = df)
>lme4.list.lm[[1]]
Call: lm(formula = formula, data = data)
Coefficients:
(Intercept)          Gap.um     Standoff.um  Voidflaeche.px  
  62.306133       -0.009878        0.026317       -0.015048  

>stepAIC(lme4.list.lm[[1]], direction="backward") 
#stepAIC on first element on the list of linear models
Start:  AIC=295.12
Scherkraft.N ~ Gap.um + Standoff.um + Voidflaeche.px
                 Df Sum of Sq    RSS    AIC
- Standoff.um     1      2.81 7187.3 293.14
- Gap.um          1     29.55 7214.0 293.37
<none>                        7184.4 295.12
- Voidflaeche.px  1    604.38 7788.8 297.97  

Error in terms.formula(formula, data = data) : 
'data' argument is of the wrong type

显然,该列表不起作用.但是我不知道这可能是什么. 由于我尝试对创建相同模型(至少相同系数)的基本软件包执行相同操作.结果如下:

>lin.model<-lm(Scherkraft.N ~ Gap.um + Standoff.um + Voidflaeche.px,df[which(df$id==1),]) 
# id is in order, so should be the same subgroup as for the first list element in lmList

Coefficients:  
(Intercept)    Gap.um  Standoff.um  Voidflaeche.px  
  62.306133 -0.009878     0.026317       -0.015048  

好吧,这就是我在linear.model上使用stepAIC返回的内容. 据我所知,akaike信息准则可用于估计在给定数据的情况下哪种模型可以更好地在拟合和泛化之间取得平衡.

>stepAIC(lin.model,direction="backward")
Start:  AIC=295.12
Scherkraft.N ~ Gap.um + Standoff.um + Voidflaeche.px
                 Df Sum of Sq    RSS    AIC
- Standoff.um     1      2.81 7187.3 293.14  
- Gap.um          1     29.55 7214.0 293.37
<none>                        7184.4 295.12
- Voidflaeche.px  1    604.38 7788.8 297.97  

Step:  AIC=293.14
Scherkraft.N ~ Gap.um + Voidflaeche.px
                 Df Sum of Sq    RSS    AIC
- Gap.um          1     28.51 7215.8 291.38
 <none>                        7187.3 293.14
- Voidflaeche.px  1    717.63 7904.9 296.85

Step:  AIC=291.38
Scherkraft.N ~ Voidflaeche.px
                 Df Sum of Sq    RSS    AIC
<none>                        7215.8 291.38
- Voidflaeche.px  1    795.46 8011.2 295.65
Call: lm(formula = Scherkraft.N ~ Voidflaeche.px, data = df[which(df$id == 1), ])

Coefficients:
(Intercept)  Voidflaeche.px  
   71.7183         -0.0151  

我从输出中读取了我应该使用的模型: Scherkraft.N〜Voidflaeche.px ,因为这是最小的AIC.好吧,如果有人可以简短地描述输出,那就太好了.我对逐步回归(假设向后消除)的理解是,所有回归变量都包含在初始模型中.然后,最不重要的一个被淘汰.决定的标准是AIC.依此类推...我不知如何在正确解释表格方面遇到了问题.如果有人可以证实我的解释,那就太好了. -"(减号)代表消除的回归变量.最上方是开始"模型,并在下表中的表格中针对可能的消除方法计算了RSS和AIC.因此,第一张表中的第一行显示模型 Scherkraft.N〜Gap.um + Standoff.um + Voidflaeche.px -Standoff.um 将得出AIC 293.14 .选择一个没有Standoff.um的: Scherkraft.N〜Gap.um + Voidflaeche.px


我将lmList {lme4}替换为dlply()以创建模型列表. stepAIC仍然无法处理该列表.它将引发另一个错误.实际上,我认为stepAIC需要运行的数据存在问题.我想知道它是如何仅从模型数据计算每个步骤的AIC值的. 将采用原始数据来构建模型,每次都保留一个回归变量.我将据此计算AIC并进行比较.因此,如果stepAIC无法访问原始数据,它将如何工作. (我看不到将原始数据传递给stepAIC的参数).不过,我不知道为什么它适用于普通模型,但不适用于列表中包装的模型.

>model.list.all <- dlply(df, .id, function(x) 
  {return(lm(Scherkraft.N~Gap.um+Standoff.um+Voidflaeche.px,data=x)) })
>stepAIC(model.list.all[[1]])
Start:  AIC=295.12
Scherkraft.N ~ Gap.um + Standoff.um + Voidflaeche.px
                 Df Sum of Sq    RSS    AIC
- Standoff.um     1      2.81 7187.3 293.14
- Gap.um          1     29.55 7214.0 293.37
<none>                        7184.4 295.12
- Voidflaeche.px  1    604.38 7788.8 297.97
Error in is.data.frame(data) : object 'x' not found

解决方案

我不确定在版本控制中可能进行了哪些更改以使调试如此困难,但是一种解决方案是使用do.call来评估在执行之前调用中的表达式.这意味着,除了在调用中仅存储d之外,以使updatestepAIC需要去查找d才能完成工作,它还存储了数据帧本身的完整表示.

也就是说,

do.call("lm", list(y~x1+x2+x3, data=d))

代替

lm(y~x1+x2+x3, data=d)

通过查看模型的call元素,您可以看到它正在尝试做什么:

dat.lin.model.lst <- lapply(split(dat, dat$id), function(d)
                            do.call("lm", list(y~x1+x2+x3, data=d)) )
dat.lin.model.lst[[1]]$call

还可以在全局环境中列出数据框架,然后构造调用,以便updatestepAIC依次查找每个数据框架,因为它们的环境链总是引回到全局环境;像这样:

dats <- split(dat, dat$id)
dat.lin.model.list <- lapply(seq_along(dats), function(d)
            do.call("lm", list(y~x1+x2+x3, data=call("[[", quote(dats),i))) )

要查看更改,请再次运行dat.lin.model.lst[[1]]$call.

I want to do stepwise regression using AIC on a list of linear models. idea is to use e a list of linear models and then apply stepAIC on each list element. It fails.

Hi guys I tried to track the problem down. I think I found the problem. However, I dont understand the cause. Try the code to see the difference between three cases.

require(MASS)
n<-30 
x1<-rnorm(n, mean=0, sd=1) #create rv x1 
x2<-rnorm(n, mean=1, sd=1)
x3<-rnorm(n, mean=2, sd=1)
epsilon<-rnorm(n,mean=0,sd=1) # random error variable 
dat<-as.data.frame(cbind(x1,x2,x3,epsilon)) # combine to a data frame
dat$id<-c(rep(1,10),rep(2,10),rep(3,10)) 
# y is combination from all three x and a random uniform variable
dat$y<-x1+x2+x3+epsilon 
# apply lm() only resulting in a list of models
dat.lin.model.lst<-lapply(split(dat,dat$id),function(d) lm(y~x1+x2+x3,data=d)) 
stepAIC(dat.lin.model.lst[[1]]) # FAIL!!!
# apply function stepAIC(lm())-  works
dat.lin.model.stepAIC.lst<-lapply(split(dat,dat$id),function(d) stepAIC(lm(y~x1+x2+x3,data=d))) 
# create model for particular group with id==1
k<-which(dat$id==1) # manually select records with id==1
lin.model.id1<-lm(dat$y[k]~dat$x1[k]+dat$x2[k]+dat$x3[k]) 
stepAIC(lin.model.id1) # check stepAIC - works!

I am pretty sure that stepAIC() needs the original data from data.frame "dat". That is what I was thinking of before. (Hope I am right on that) But there is no parameter in stepAIC() where I can pass the original data frame. Obviously, for plain models not wrapped in a list it's enough to pass the model. (last three lines in code) So I am wondering:
Q1: How does stepAIC knows where to find the original data "dat" (not only the model data which is passed as parameter)?
Q2: How can I possibly know that there is another parameter in stepAIC() which is not explicitly stated in the help pages? (maybe my English is just too bad to find)
Q3: How can I pass that parameter to stepAIC()?

It must be somewhere in the environment of the apply function and passing on the data. Either lm() or stepAIC() and the pointer/link to the raw data must get lost somewhere. I have not a good understanding what an environment in R does. For me it was kind of isolating local from global variables. But maybe its more complicated. Anyone who can explain that to me in regard to the problem above? Honestly, I dont read much out of the R documentation. Any better understanding would help me. Thank you.

OLD: I have data in a dataframe df that can be split into several subgroups. For that purpose I created a groupID called df$id. lm() returns the coefficent as expected for the first subgroup. I want to do a stepwise regression using AIC as criterion for each subgroup separately. I use lmList {lme4} which results in a model for each subgroup (id). But if I use stepAIC{MASS} for the list elements it throws an error. see below.

So the question is: What mistake is in my procedure/syntax? I get results for single models but not the ones created with lmList. Does lmList() store different information on the model than lm() does?
But in the help it states: class "lmList": A list of objects of class lm with a common model.

>lme4.list.lm<-lmList(formula=Scherkraft.N~Gap.um+Standoff.um+Voidflaeche.px |df$id,data = df)
>lme4.list.lm[[1]]
Call: lm(formula = formula, data = data)
Coefficients:
(Intercept)          Gap.um     Standoff.um  Voidflaeche.px  
  62.306133       -0.009878        0.026317       -0.015048  

>stepAIC(lme4.list.lm[[1]], direction="backward") 
#stepAIC on first element on the list of linear models
Start:  AIC=295.12
Scherkraft.N ~ Gap.um + Standoff.um + Voidflaeche.px
                 Df Sum of Sq    RSS    AIC
- Standoff.um     1      2.81 7187.3 293.14
- Gap.um          1     29.55 7214.0 293.37
<none>                        7184.4 295.12
- Voidflaeche.px  1    604.38 7788.8 297.97  

Error in terms.formula(formula, data = data) : 
'data' argument is of the wrong type

Obviously something does not work with the list. But I have not an idea what it might be. Since I tried to do the same with the base package which creates the same model (at least the same coefficients). Results are below:

>lin.model<-lm(Scherkraft.N ~ Gap.um + Standoff.um + Voidflaeche.px,df[which(df$id==1),]) 
# id is in order, so should be the same subgroup as for the first list element in lmList

Coefficients:  
(Intercept)    Gap.um  Standoff.um  Voidflaeche.px  
  62.306133 -0.009878     0.026317       -0.015048  

Well, this is what I get returned using stepAIC on my linear.model . As far as I know the akaike information criterion can be used to estimate which model better balances between fit and generalization given some data.

>stepAIC(lin.model,direction="backward")
Start:  AIC=295.12
Scherkraft.N ~ Gap.um + Standoff.um + Voidflaeche.px
                 Df Sum of Sq    RSS    AIC
- Standoff.um     1      2.81 7187.3 293.14  
- Gap.um          1     29.55 7214.0 293.37
<none>                        7184.4 295.12
- Voidflaeche.px  1    604.38 7788.8 297.97  

Step:  AIC=293.14
Scherkraft.N ~ Gap.um + Voidflaeche.px
                 Df Sum of Sq    RSS    AIC
- Gap.um          1     28.51 7215.8 291.38
 <none>                        7187.3 293.14
- Voidflaeche.px  1    717.63 7904.9 296.85

Step:  AIC=291.38
Scherkraft.N ~ Voidflaeche.px
                 Df Sum of Sq    RSS    AIC
<none>                        7215.8 291.38
- Voidflaeche.px  1    795.46 8011.2 295.65
Call: lm(formula = Scherkraft.N ~ Voidflaeche.px, data = df[which(df$id == 1), ])

Coefficients:
(Intercept)  Voidflaeche.px  
   71.7183         -0.0151  

I read from the output I should use the model: Scherkraft.N ~ Voidflaeche.px because this is the minimal AIC. Well, it would be nice if someone could shortly describe the output. My understanding of the stepwise regression (assuming backwards elimination) is all regressors are included in the initial model. Then the least important one is eliminated. The criterion to decide is the AIC. and so forth... Somehow I have problems to get the tables interpreted right. It would be nice if someone could confirm my interpretation. The "-"(minus) stands for the eliminated regressor. On top is the "start" model and in the table table below the RSS and AIC are calculated for possible eliminations. So the first row in the first table says a model Scherkraft.N~Gap.um+Standoff.um+Voidflaeche.px - Standoff.um would result in an AIC 293.14. Choose the one without Standoff.um: Scherkraft.N~Gap.um+Voidflaeche.px

EDIT:
I replaced the lmList{lme4} with dlply() to create the list of models. Still stepAIC is not coping with the list. It throws another error. Actually, I believe it is a problem with the data stepAIC needs to run through. I was wondering how it calculates the AIC-value for each step just from the model data. I would take the original data to construct the models leaving one regressor out each time. Thereof I would calculate the AIC and compare. So how stepAIC is working if it has not access to the original data. (I cant see a parameter where I pass the original data to stepAIC). Still, I have no clue why it works with a plain model but not with the model wrapped in a list.

>model.list.all <- dlply(df, .id, function(x) 
  {return(lm(Scherkraft.N~Gap.um+Standoff.um+Voidflaeche.px,data=x)) })
>stepAIC(model.list.all[[1]])
Start:  AIC=295.12
Scherkraft.N ~ Gap.um + Standoff.um + Voidflaeche.px
                 Df Sum of Sq    RSS    AIC
- Standoff.um     1      2.81 7187.3 293.14
- Gap.um          1     29.55 7214.0 293.37
<none>                        7184.4 295.12
- Voidflaeche.px  1    604.38 7788.8 297.97
Error in is.data.frame(data) : object 'x' not found

解决方案

I'm not sure what may have changed in the versioning to make the debugging so difficult, but one solution would be to use do.call, which evaluates the expressions in the call before executing it. This means that instead of storing just d in the call, so that update and stepAIC need to go find d in order to do their work, it stores a full representation of the data frame itself.

That is, do

do.call("lm", list(y~x1+x2+x3, data=d))

instead of

lm(y~x1+x2+x3, data=d)

You can see what it's trying to do by looking at the call element of the model, perhaps like this:

dat.lin.model.lst <- lapply(split(dat, dat$id), function(d)
                            do.call("lm", list(y~x1+x2+x3, data=d)) )
dat.lin.model.lst[[1]]$call

It's also possible to make your list of data frames in the global environment and then construct the call so that update and stepAIC look for each data frame in turn, because their environment chains always lead back to the global environment; like this:

dats <- split(dat, dat$id)
dat.lin.model.list <- lapply(seq_along(dats), function(d)
            do.call("lm", list(y~x1+x2+x3, data=call("[[", quote(dats),i))) )

To see what's changed, run dat.lin.model.lst[[1]]$call again.

这篇关于在模型列表上使用stepAIC的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆