使用lm对唯一因子组合的子集进行回归 [英] regression on subsets for unique factor combinations using lm

查看:106
本文介绍了使用lm对唯一因子组合的子集进行回归的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想对分组变量的唯一组合定义的子集进行简单的多元回归自动化.我有一个带有几个分组变量df1 [,1:6]和一些独立变量df1 [,8:10]以及响应df1 [,7]的数据框.

I would like to automate a simple multiple regression for the subsets defined by the unique combinations of the grouping variables. I have a dataframe with several grouping variables df1[,1:6] and some independent variables df1[,8:10] and a response df1[,7].

这是摘录的数据.

structure(list(Surface = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("NiAu", "Sn"), class = "factor"), Supplier = structure(c(1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor"), ParticleSize = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("3", "5"), class = "factor"), T1 = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L), .Label = c("130", "144"), class = "factor"), T2 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "200", class = "factor"), O2 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "1300", class = "factor"), Shear = c(56.83, 67.73, 78.51, 62.61, 66.78, 60.89, 62.94, 76.34, 70.56, 70.4, 54.15), Gap = c(373, 450, 417, 450, 406, 439, 439, 417, 439, 441, 417), Clearance = c(500.13, 509.85, 495.97, 499.55, 502.66, 505.33, 500.32, 503.28, 507.44, 500.5, 498.39), Void = c(316, 343, 89, 247, 271, 326, 304, 282, 437, 243, 116)), .Names = c("Surface", "Supplier", "ParticleSize","T1", "T2", "O2", "Shear", "Gap", "Clearance", "Void"), class = "data.frame", row.names = c(NA, -11L))

使用unique(df1 [,1:6])返回5个分组变量的因子组合.因此,应该将lm()函数应用到5个子集. 我的电话看起来像这样

Using unique(df1[,1:6]) returns 5 factor combinations of the grouping variables. So there should be 5 subsets where I apply the lm() function to. My call looks like that

df1.fit.by<-with(df1,by(df1,df1[,1:6], function(x) lm(Shear~Gap+Clearance+Void,data=x)))
sapply(df1.fit.by,coef)

问题1:它返回包含16个列表条目的列表.显然,它会计算前六个分组变量的所有可能因子组合. (摘录中V5 + V6仅具有电平,而V1:4具有两个电平.结果为2 ^ 4 = 16),但它仅应在数据中使用实际存在的因子组合.因此,我认为by()不是实现该目标的正确函数.有什么建议吗?
问题2:我发现引用列索引比引用变量名称更容易.因此,我最初尝试以lm(df1 [,7]〜df1 [,8] + df1 [,9])的方式使用lm()函数.那没有解决.因为我总是访问整个df1数据帧而不是子集.因此,可能应该将因子组合的行索引传递给lm()函数,而不是完整的数据帧.

Problem 1: it returns a list with 16 list entries. Apparently, it calculates all possible factor combinations of the first six grouping variables. (V5+V6 only have on level but V1:4 have two levels level in the excerpt. Resulting in 2^4=16) But it should only use the real existing factor combinations in the data. So I suppose by() is not the correct function to achieve that. Any suggestions?
Problem 2: I find it easier to refer to column indices rather than variable names. So I was initially trying to use my lm() function in the way lm(df1[,7]~df1[,8]+df1[,9]). That did not work out. Because I always access the entire df1 dataframe instead of the subsets. So probably I should pass the row indeces for the factor combinations to the lm()function rather than a complete dataframe.

我认为问题1和问题2的解决方案之间存在某种联系,并使用另一个子集函数进行了解决.如果有人可以尝试解释我的错误所在,那就太好了.如果可能的话,我会坚持使用标准软件包只是因为我想提高对R的理解.谢谢

I think the solution to problem 1 and 2 are somehow related and solved using another subset function. It would be nice if someone can try to explain where my mistake is. If its possible I would stick to the standard packages simply because I want to improve my understanding of R. Thanks

变量分配中的一个小错误

a minor mistake in the variable assignment

推荐答案

您可以使用plyr软件包:

require(plyr)
list_reg <- dlply(df1, .(Surface, Supplier, ParticleSize, T1, T2), function(df) 
  {lm(Shear~Gap+Clearance+Void,data=df)})
#We have indeed five different results
length(list_reg)
#That's how you check out one particular regression, in this case the first
summary(list_reg[[1]])

函数dlply在您的情况下为df1接受一个data.frame(这就是d ...的意思),然后返回一个列表(这就是.l ...的意思).您的案例由五个元素组成,每个元素都包含一个回归的结果.

The function dlply takes a data.frame (that's what the d... stands for), in your case df1, and returns a list (that's what the .l... stands for), in your case consisting of five elements, each containing the results of one regression.

在内部,您的df1根据.(Surface, Supplier, ParticleSize, T1, T2)指定的列分为五个子数据帧,并且函数lm(Shear~Gap+Clearance+Void,data=df)应用于每个这些子数据帧.

Internally, your df1 is split up into five sub-data.frames according to the columns specified by .(Surface, Supplier, ParticleSize, T1, T2) and the function lm(Shear~Gap+Clearance+Void,data=df) is applied to every of these sub-data.frames.

要更好地了解dlply的实际功能,只需致电

To get a better feeling of what dlply really does, just call

list_sub_df <- dlply(df1, .(Surface, Supplier, ParticleSize, T1, T2))

,您可以查看将lm应用于其上的每个子data.frame.

and you can look at each sub-data.frame on which the lm will be applied to.

最后只是一个一般性注释:纸张由程序包的作者Hadley Wickham确实很棒:即使您最终不会使用他的程序包,也可以对拆分应用组合方法感到满意.

And just a general note at the end: The paper by the package author Hadley Wickham is really great: even if you won't end up using his package, it is still really good to get a feeling about the split-apply-combine approach.

我只是做了一个快速搜索,并且正如预期的那样,这在之前已经得到了更好的解释,因此也请确保阅读此

I just did a quick search and as expected, this was already explained better before, so also make sure to read this SO post.

如果您想直接使用列号,请尝试此操作(取自此

If you want to use the column numbers directly, try this (taken from this SO post):

 list_reg <- dlply(df1, names(df1[, 1:5]), function(df) 
      {lm(Shear~Gap+Clearance+Void,data=df)})

这篇关于使用lm对唯一因子组合的子集进行回归的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆