lm摘要未显示所有因子水平 [英] `lm` summary not display all factor levels

查看:58
本文介绍了lm摘要未显示所有因子水平的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在对多个属性(包括两个分类属性,BF)进行线性回归,但是我没有获得每个系数水平的系数值.

I am running a linear regression on a number of attributes including two categorical attributes, B and F, and I don't get a coefficient value for every factor level I have.

B具有9个级别,而F具有6个级别.最初运行模型(带有截距)时,我得到了B的8个系数和F的5个系数,我将其理解为截距中包含的每个系数的第一级.

B has 9 levels and F has 6 levels. When I initially ran the model (with intercepts), I got 8 coefficients for B and 5 for F which I understood as the first level of each being included in the intercept.

我想基于BF中的级别对它们的系数进行排名,因此我在每个因子之后添加了-1以将截距锁定为0,以便可以获取所有级别的系数.

I want ranking the levels within B and F based on their coefficient so I added -1 after each factor to lock the intercept at 0 so that I could get coefficients for all levels.

Call:
lm(formula = dependent ~ a + B-1 + c + d + e + F-1 + g + h, data = input)

Coefficients:
       Estimate Std. Error t value Pr(>|t|)    
a     2.082e+03  1.026e+02  20.302  < 2e-16 ***
B1   -1.660e+04  9.747e+02 -17.027  < 2e-16 ***
B2   -1.681e+04  9.379e+02 -17.920  < 2e-16 ***
B3   -1.653e+04  9.254e+02 -17.858  < 2e-16 ***
B4   -1.765e+04  9.697e+02 -18.202  < 2e-16 ***
B5   -1.535e+04  1.388e+03 -11.059  < 2e-16 ***
B6   -1.677e+04  9.891e+02 -16.954  < 2e-16 ***
B7   -1.644e+04  9.694e+02 -16.961  < 2e-16 ***
B8   -1.931e+04  9.899e+02 -19.512  < 2e-16 ***
B9   -1.722e+04  9.071e+02 -18.980  < 2e-16 ***
c    -6.928e-01  6.977e-01  -0.993 0.321272    
d    -3.288e-01  2.613e+00  -0.126 0.899933    
e    -8.384e-01  1.171e+00  -0.716 0.474396    
F2    4.679e+02  2.176e+02   2.150 0.032146 *  
F3    7.753e+02  2.035e+02   3.810 0.000159 ***
F4    1.885e+02  1.689e+02   1.116 0.265046    
F5    5.194e+02  2.264e+02   2.295 0.022246 *  
F6    1.365e+03  2.334e+02   5.848 9.94e-09 ***
g     4.278e+00  7.350e+00   0.582 0.560847    
h     2.717e-02  5.100e-03   5.328 1.62e-07 ***

这部分起作用,导致显示所有B级别,但是仍然不显示F1.由于不再存在截距,我很困惑为什么F1不在线性模型中.

This worked in part, leading to the display of all levels of B, however F1 is still not displayed. As there is no longer an intercept I am confused why F1 is not in the linear model.

切换呼叫顺序,使+ F - 1+ B - 1之前,会导致F的所有级别的系数可见,但B1却不可见.

Switching the order of the call so that + F - 1 precedes + B - 1 results in coefficients of all levels of F being visible but not B1.

有人知道如何显示BF的所有水平,或者如何从我的输出中评估F1与其他水平的F的相对权重吗?

Does anybody know either how to display all levels of both B and F, or how to assess the relative weight of F1 compared to other levels of F from the outputs I have?

推荐答案

这个问题一遍又一遍地提出,但是不幸的是,没有令人满意的答案可以作为适当的重复目标.看来我需要写一个.

This issue is raised over and over again, but unfortunately no satisfying answer has been made which can be an appropriate duplicate target. Looks like I need to write one.

大多数人都知道这与对比度"有关,但并不是每个人都知道为什么需要它以及如何理解它的结果.为了充分理解这一点,我们必须查看模型矩阵.

Most people know this is related to "contrasts", but not everyone knows why it is needed, and how to understand its result. We have to look at model matrix in order to fully digest this.

假设我们对具有两个因素的模型感兴趣:~ f + g(数值协变量无关紧要,因此我不包含任何变量;响应未出现在模型矩阵中,因此也将其删除).考虑以下可重现的示例:

Suppose we are interested in a model with two factors: ~ f + g (numerical covariates do not matter so I include none of them; the response does not appear in model matrix, so drop it, too). Consider the following reproducible example:

set.seed(0)

f <- sample(gl(3, 4, labels = letters[1:3]))
# [1] c a a b b a c b c b a c
#Levels: a b c

g <- sample(gl(3, 4, labels = LETTERS[1:3]))
# [1] A B A B C B C A C C A B
#Levels: A B C

我们从根本没有对比的模型矩阵开始:

We start with a model matrix with no contrasts at all:

X0 <- model.matrix(~ f + g, contrasts.arg = list(
                   f = contr.treatment(n = 3, contrasts = FALSE),
                   g = contr.treatment(n = 3, contrasts = FALSE)))

#   (Intercept) f1 f2 f3 g1 g2 g3
#1            1  0  0  1  1  0  0
#2            1  1  0  0  0  1  0
#3            1  1  0  0  1  0  0
#4            1  0  1  0  0  1  0
#5            1  0  1  0  0  0  1
#6            1  1  0  0  0  1  0
#7            1  0  0  1  0  0  1
#8            1  0  1  0  1  0  0
#9            1  0  0  1  0  0  1
#10           1  0  1  0  0  0  1
#11           1  1  0  0  1  0  0
#12           1  0  0  1  0  1  0

注意,我们有:

unname( rowSums(X0[, c("f1", "f2", "f3")]) )
# [1] 1 1 1 1 1 1 1 1 1 1 1 1

unname( rowSums(X0[, c("g1", "g2", "g3")]) ) 
# [1] 1 1 1 1 1 1 1 1 1 1 1 1

所以span{f1, f2, f3} = span{g1, g2, g3} = span{(Intercept)}. 在此完整规范中,无法识别2列. X0的列排名为1 + 3 + 3 - 2 = 5 :

So span{f1, f2, f3} = span{g1, g2, g3} = span{(Intercept)}. In this full specification, 2 columns are not identifiable. X0 will have column rank 1 + 3 + 3 - 2 = 5:

qr(X0)$rank
# [1] 5

因此,如果我们使用此X0拟合线性模型,则7个参数中的2个系数将为NA:

So, if we fit a linear model with this X0, 2 coefficients out of 7 parameters will be NA:

y <- rnorm(12)  ## random `y` as a response
lm(y ~ X - 1)  ## drop intercept as `X` has intercept already

#X0(Intercept)           X0f1           X0f2           X0f3           X0g1  
#      0.32118        0.05039       -0.22184             NA       -0.92868  
#         X0g2           X0g3  
#     -0.48809             NA  

这实际上意味着,我们必须在7个参数上添加2个线性约束,才能获得完整的秩模型.这两个约束到底是什么并不重要,但是必须有两个线性独立的约束.例如,我们可以执行以下任一操作:

What this really implies, is that we have to add 2 linear constraints on 7 parameters, in order to get a full rank model. It does not really matter what these 2 constraints are, but there must be 2 linearly independent constrains. For example, we can do either of the following:

  • X0删除任意两列;
  • 在参数上添加两个零和的约束,例如我们要求f1f2f3的系数总和为0,而g1g2g3的系数相同.
  • 使用正则化,例如,向fg添加岭罚.
  • drop any 2 columns from X0;
  • add two sum-to-zero constrains on parameters, like we require coefficients for f1, f2 and f3 sum to 0, and the same for g1, g2 and g3.
  • use regularization, for example, adding ridge penalty to f and g.

请注意,这三种方式最终会带来三种不同的解决方案:

Note, these three ways end up with three different solutions:

  • 对比;
  • 约束最小二乘法;
  • 线性混合模型或惩罚最小二乘法.

前两个仍然在固定效果建模的范围内.通过对比",我们减少了参数的数量,直到获得完整的秩模型矩阵为止.而另外两个并没有减少参数的数量,但是有效地减少了有效的自由度.

The first two are still in the scope of fixed effect modelling. By "contrasts", we reduce the number of parameters until we get a full rank model matrix; while the other two does not reduce the number of parameters, but effectively reduces the effective degree of freedom.

现在,您肯定会追求对比"方式.因此,请记住,我们必须删除2列.他们可以是

Now, you are certainly after the "contrasts" way. So, remember, we have to drop 2 columns. They can be

  • 来自f的一列和来自g的一列,生成了~ f + g模型,其中fg形成对比;
  • 拦截,并从fg中选择一列,以提供模型~ f + g - 1.
  • one column from f and one column from g, giving to a model ~ f + g, with f and g contrasted;
  • intercept, and one column from either f or g, giving to a model ~ f + g - 1.

现在您应该清楚,在删除列的框架内,您无法获得所需的内容,因为您希望只删除1列.结果模型矩阵仍将是秩不足的.

Now you should be clear, that within the framework of dropping columns, there is no way you can get what you want, because you are expecting to drop only 1 column. The resulting model matrix will still be rank-deficient.

如果您真的想在其中拥有所有系数,请使用约束最小二乘或惩罚回归/线性混合模型.

If you really want to have all coefficients there, use constrained least squares, or penalized regression / linear mixed models.

现在,当我们具有各种因素的相互作用时,事情会更加复杂,但是想法仍然相同.但是鉴于我的回答已经足够长了,所以我不想继续.

Now, when we have interaction of factors, things are more complicated but the idea is still the same. But given that my answer is already long enough, I don't want to continue.

这篇关于lm摘要未显示所有因子水平的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆