lm摘要未显示所有因子水平 [英] `lm` summary not display all factor levels
问题描述
我正在对多个属性(包括两个分类属性,B
和F
)进行线性回归,但是我没有获得每个系数水平的系数值.
I am running a linear regression on a number of attributes including two categorical attributes, B
and F
, and I don't get a coefficient value for every factor level I have.
B
具有9个级别,而F
具有6个级别.最初运行模型(带有截距)时,我得到了B
的8个系数和F
的5个系数,我将其理解为截距中包含的每个系数的第一级.
B
has 9 levels and F
has 6 levels. When I initially ran the model (with intercepts), I got 8 coefficients for B
and 5 for F
which I understood as the first level of each being included in the intercept.
我想基于B
和F
中的级别对它们的系数进行排名,因此我在每个因子之后添加了-1
以将截距锁定为0,以便可以获取所有级别的系数.
I want ranking the levels within B
and F
based on their coefficient so I added -1
after each factor to lock the intercept at 0 so that I could get coefficients for all levels.
Call:
lm(formula = dependent ~ a + B-1 + c + d + e + F-1 + g + h, data = input)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
a 2.082e+03 1.026e+02 20.302 < 2e-16 ***
B1 -1.660e+04 9.747e+02 -17.027 < 2e-16 ***
B2 -1.681e+04 9.379e+02 -17.920 < 2e-16 ***
B3 -1.653e+04 9.254e+02 -17.858 < 2e-16 ***
B4 -1.765e+04 9.697e+02 -18.202 < 2e-16 ***
B5 -1.535e+04 1.388e+03 -11.059 < 2e-16 ***
B6 -1.677e+04 9.891e+02 -16.954 < 2e-16 ***
B7 -1.644e+04 9.694e+02 -16.961 < 2e-16 ***
B8 -1.931e+04 9.899e+02 -19.512 < 2e-16 ***
B9 -1.722e+04 9.071e+02 -18.980 < 2e-16 ***
c -6.928e-01 6.977e-01 -0.993 0.321272
d -3.288e-01 2.613e+00 -0.126 0.899933
e -8.384e-01 1.171e+00 -0.716 0.474396
F2 4.679e+02 2.176e+02 2.150 0.032146 *
F3 7.753e+02 2.035e+02 3.810 0.000159 ***
F4 1.885e+02 1.689e+02 1.116 0.265046
F5 5.194e+02 2.264e+02 2.295 0.022246 *
F6 1.365e+03 2.334e+02 5.848 9.94e-09 ***
g 4.278e+00 7.350e+00 0.582 0.560847
h 2.717e-02 5.100e-03 5.328 1.62e-07 ***
这部分起作用,导致显示所有B
级别,但是仍然不显示F1
.由于不再存在截距,我很困惑为什么F1
不在线性模型中.
This worked in part, leading to the display of all levels of B
, however F1
is still not displayed. As there is no longer an intercept I am confused why F1
is not in the linear model.
切换呼叫顺序,使+ F - 1
在+ B - 1
之前,会导致F
的所有级别的系数可见,但B1
却不可见.
Switching the order of the call so that + F - 1
precedes + B - 1
results in coefficients of all levels of F
being visible but not B1
.
有人知道如何显示B
和F
的所有水平,或者如何从我的输出中评估F1
与其他水平的F
的相对权重吗?
Does anybody know either how to display all levels of both B
and F
, or how to assess the relative weight of F1
compared to other levels of F
from the outputs I have?
推荐答案
这个问题一遍又一遍地提出,但是不幸的是,没有令人满意的答案可以作为适当的重复目标.看来我需要写一个.
This issue is raised over and over again, but unfortunately no satisfying answer has been made which can be an appropriate duplicate target. Looks like I need to write one.
大多数人都知道这与对比度"有关,但并不是每个人都知道为什么需要它以及如何理解它的结果.为了充分理解这一点,我们必须查看模型矩阵.
Most people know this is related to "contrasts", but not everyone knows why it is needed, and how to understand its result. We have to look at model matrix in order to fully digest this.
假设我们对具有两个因素的模型感兴趣:~ f + g
(数值协变量无关紧要,因此我不包含任何变量;响应未出现在模型矩阵中,因此也将其删除).考虑以下可重现的示例:
Suppose we are interested in a model with two factors: ~ f + g
(numerical covariates do not matter so I include none of them; the response does not appear in model matrix, so drop it, too). Consider the following reproducible example:
set.seed(0)
f <- sample(gl(3, 4, labels = letters[1:3]))
# [1] c a a b b a c b c b a c
#Levels: a b c
g <- sample(gl(3, 4, labels = LETTERS[1:3]))
# [1] A B A B C B C A C C A B
#Levels: A B C
我们从根本没有对比的模型矩阵开始:
We start with a model matrix with no contrasts at all:
X0 <- model.matrix(~ f + g, contrasts.arg = list(
f = contr.treatment(n = 3, contrasts = FALSE),
g = contr.treatment(n = 3, contrasts = FALSE)))
# (Intercept) f1 f2 f3 g1 g2 g3
#1 1 0 0 1 1 0 0
#2 1 1 0 0 0 1 0
#3 1 1 0 0 1 0 0
#4 1 0 1 0 0 1 0
#5 1 0 1 0 0 0 1
#6 1 1 0 0 0 1 0
#7 1 0 0 1 0 0 1
#8 1 0 1 0 1 0 0
#9 1 0 0 1 0 0 1
#10 1 0 1 0 0 0 1
#11 1 1 0 0 1 0 0
#12 1 0 0 1 0 1 0
注意,我们有:
unname( rowSums(X0[, c("f1", "f2", "f3")]) )
# [1] 1 1 1 1 1 1 1 1 1 1 1 1
unname( rowSums(X0[, c("g1", "g2", "g3")]) )
# [1] 1 1 1 1 1 1 1 1 1 1 1 1
所以span{f1, f2, f3} = span{g1, g2, g3} = span{(Intercept)}
. 在此完整规范中,无法识别2列. X0
的列排名为1 + 3 + 3 - 2 = 5
:
So span{f1, f2, f3} = span{g1, g2, g3} = span{(Intercept)}
. In this full specification, 2 columns are not identifiable. X0
will have column rank 1 + 3 + 3 - 2 = 5
:
qr(X0)$rank
# [1] 5
因此,如果我们使用此X0
拟合线性模型,则7个参数中的2个系数将为NA
:
So, if we fit a linear model with this X0
, 2 coefficients out of 7 parameters will be NA
:
y <- rnorm(12) ## random `y` as a response
lm(y ~ X - 1) ## drop intercept as `X` has intercept already
#X0(Intercept) X0f1 X0f2 X0f3 X0g1
# 0.32118 0.05039 -0.22184 NA -0.92868
# X0g2 X0g3
# -0.48809 NA
这实际上意味着,我们必须在7个参数上添加2个线性约束,才能获得完整的秩模型.这两个约束到底是什么并不重要,但是必须有两个线性独立的约束.例如,我们可以执行以下任一操作:
What this really implies, is that we have to add 2 linear constraints on 7 parameters, in order to get a full rank model. It does not really matter what these 2 constraints are, but there must be 2 linearly independent constrains. For example, we can do either of the following:
- 从
X0
删除任意两列; - 在参数上添加两个零和的约束,例如我们要求
f1
,f2
和f3
的系数总和为0,而g1
,g2
和g3
的系数相同. - 使用正则化,例如,向
f
和g
添加岭罚.
- drop any 2 columns from
X0
; - add two sum-to-zero constrains on parameters, like we require coefficients for
f1
,f2
andf3
sum to 0, and the same forg1
,g2
andg3
. - use regularization, for example, adding ridge penalty to
f
andg
.
请注意,这三种方式最终会带来三种不同的解决方案:
Note, these three ways end up with three different solutions:
- 对比;
- 约束最小二乘法;
- 线性混合模型或惩罚最小二乘法.
前两个仍然在固定效果建模的范围内.通过对比",我们减少了参数的数量,直到获得完整的秩模型矩阵为止.而另外两个并没有减少参数的数量,但是有效地减少了有效的自由度.
The first two are still in the scope of fixed effect modelling. By "contrasts", we reduce the number of parameters until we get a full rank model matrix; while the other two does not reduce the number of parameters, but effectively reduces the effective degree of freedom.
现在,您肯定会追求对比"方式.因此,请记住,我们必须删除2列.他们可以是
Now, you are certainly after the "contrasts" way. So, remember, we have to drop 2 columns. They can be
- 来自
f
的一列和来自g
的一列,生成了~ f + g
模型,其中f
和g
形成对比; - 拦截,并从
f
或g
中选择一列,以提供模型~ f + g - 1
.
- one column from
f
and one column fromg
, giving to a model~ f + g
, withf
andg
contrasted; - intercept, and one column from either
f
org
, giving to a model~ f + g - 1
.
现在您应该清楚,在删除列的框架内,您无法获得所需的内容,因为您希望只删除1列.结果模型矩阵仍将是秩不足的.
Now you should be clear, that within the framework of dropping columns, there is no way you can get what you want, because you are expecting to drop only 1 column. The resulting model matrix will still be rank-deficient.
如果您真的想在其中拥有所有系数,请使用约束最小二乘或惩罚回归/线性混合模型.
If you really want to have all coefficients there, use constrained least squares, or penalized regression / linear mixed models.
现在,当我们具有各种因素的相互作用时,事情会更加复杂,但是想法仍然相同.但是鉴于我的回答已经足够长了,所以我不想继续.
Now, when we have interaction of factors, things are more complicated but the idea is still the same. But given that my answer is already long enough, I don't want to continue.
这篇关于lm摘要未显示所有因子水平的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!