如何在R中使用公式排除主要效应但保留相互作用 [英] How to use formula in R to exclude main effect but retain interaction

查看:70
本文介绍了如何在R中使用公式排除主要效应但保留相互作用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不想要主效果,因为它与共线的因子固定效果更好,所以拥有这些NA很烦人.

在此示例中:

lm(y ~ x * z)

我想要x(数字)和z(因子)的交互作用,而不是z的主要作用.

解决方案

简介

?formula的R文档说:

"*"运算符表示因子交叉:"a * b"解释为"a + b + a:b

因此,听起来很简单,只需执行以下操作之一即可删除主要效果:

a + a:b  ## main effect on `b` is dropped
b + a:b  ## main effect on `a` is dropped
a:b      ## main effects on both `a` and `b` are dropped

哦,真的吗?不不不(太简单,太天真).实际上,它取决于ab的变量类.

  • 如果它们都不是因素,或者只有一个因素是因素,那么这是正确的;
  • 如果两者都是因素,则不会.存在互动(高阶效果)时,您永远不能放弃主效果(低阶效果).

这种行为是由于称为model.matrix.default的魔术函数引起的,该魔术函数根据公式构造设计矩阵.仅将数字变量包括在列中,但是因子变量会自动编码为许多虚拟列.正是这种虚拟的编码才是神奇的.通常认为,我们可以启用或禁用对比度来控制它,但实际上并非如此.即使在这个最简单的示例中,我们也无法控制对比度.问题是model.matrix.default在进行伪编码时有其自己的规则,并且对指定模型公式非常敏感.正是由于这个原因,当两个因素之间存在相互作用时,我们不能放弃主要作用.


数字和因子之间的相互作用

根据您的问题,x是数字,而z是一个因数.您可以通过

y ~ x + x:z

指定具有交互作用但不具有z主作用的模型

y ~ x + x:z

由于x是数字,因此等同于do

y ~ x:z

这里唯一的区别是参数化(或model.matrix.default如何进行伪编码).考虑一个小例子:

set.seed(0)
y <- rnorm(10)
x <- rnorm(10)
z <- gl(2, 5, labels = letters[1:2])

fit1 <- lm(y ~ x + x:z)
#Coefficients:
#(Intercept)            x         x:zb  
#     0.1989      -0.1627      -0.5456  

fit2 <- lm(y ~ x:z)
#Coefficients:
#(Intercept)         x:za         x:zb  
#     0.1989      -0.1627      -0.7082 

从系数的名称中我们可以看到,在第一个规范中,z被对比,因此其第一级"a"未被伪编码,而在第二个规范中,未对比z,并且两个级均被对比". a"和"b"是伪编码的.鉴于两个规范最终都具有三个系数,因此它们实际上是等效的(从数学上来说,两种情况下的设计矩阵具有相同的列空间),您可以通过比较其拟合值来验证这一点:

all.equal(fit1$fitted, fit2$fitted)
# [1] TRUE

那么为什么z在第一种情况下对比?因为否则我们将为x:z提供两个伪列,并且这两列的总和仅为x,在公式中使用现有模型项x作为别名.实际上,在这种情况下,即使您不希望使用对比,model.matrix.default也不会遵守:

model.matrix.default(y ~ x + x:z,
      contrast.arg = list(z = contr.treatment(nlevels(z), contrasts = FALSE)))
#   (Intercept)          x       x:zb
#1            1  0.7635935  0.0000000
#2            1 -0.7990092  0.0000000
#3            1 -1.1476570  0.0000000
#4            1 -0.2894616  0.0000000
#5            1 -0.2992151  0.0000000
#6            1 -0.4115108 -0.4115108
#7            1  0.2522234  0.2522234
#8            1 -0.8919211 -0.8919211
#9            1  0.4356833  0.4356833
#10           1 -1.2375384 -1.2375384

那么为什么在第二种情况下z没有对比?因为如果是这样,我们在构建交互时会松散级别"a"的影响.即使您需要对比,model.matrix.default也会忽略您:

model.matrix.default(y ~ x:z,
      contrast.arg = list(z = contr.treatment(nlevels(z), contrasts = TRUE)))
#   (Intercept)       x:za       x:zb
#1            1  0.7635935  0.0000000
#2            1 -0.7990092  0.0000000
#3            1 -1.1476570  0.0000000
#4            1 -0.2894616  0.0000000
#5            1 -0.2992151  0.0000000
#6            1  0.0000000 -0.4115108
#7            1  0.0000000  0.2522234
#8            1  0.0000000 -0.8919211
#9            1  0.0000000  0.4356833
#10           1  0.0000000 -1.2375384

哦,太棒了model.matrix.default.能够做出正确的决定!


两个因素之间的相互作用

让我重申一下:存在交互作用时,无法降低主要效果.

在这里我将不提供其他示例,因为我在中有一个示例,为什么我会得到NA系数以及lm如何删除引用互动水平.请参阅此处的交互的对比"部分.简而言之,以下所有规格都给出相同的模型(它们具有相同的拟合值):

~ year:treatment
~ year:treatment + 0
~ year + year:treatment
~ treatment + year:treatment
~ year + treatment + year:treatment
~ year * treatment

尤其是,第一个规格会导致NA系数.

因此,一旦~的RHS包含year:treatment,就永远不会要求model.matrix.default放弃主要效果.

对此行为缺乏经验的人会感到惊讶制作方差分析表时.


绕过model.matrix.default

有些人认为model.matrix.default令人讨厌,因为它在伪编码中似乎没有一致的方式.他们认为一致的方式"是始终降低第一要素水平.好吧,没问题,您可以通过手动进行虚拟编码来绕过model.matrix.default,并将生成的虚拟矩阵作为变量输入lm等.

但是,您仍然需要model.matrix.default的帮助才能轻松地对 a (是,仅一个)因子变量进行伪编码.例如,对于上一个示例中的变量z,下面是其完整的伪编码,您可以保留其所有或部分列以进行回归.

Z <- model.matrix.default(~ z + 0)  ## no contrasts (as there is no intercept)
#   za zb
#1   1  0
#2   1  0
#3   1  0
#4   1  0
#5   1  0
#6   0  1
#7   0  1
#8   0  1
#9   0  1
#10  0  1
#attr(,"assign")
#[1] 1 1
#attr(,"contrasts")
#attr(,"contrasts")$z
#[1] "contr.treatment"

回到我们的简单示例,如果我们不想在y ~ x + x:z中使用z进行对比,则可以

Z2 <- Z[, 1:2]  ## use "[" to remove attributes of `Z`
lm(y ~ x + x:Z2)
#Coefficients:
#(Intercept)            x       x:Z2za       x:Z2zb  
#     0.1989      -0.7082       0.5456           NA

不足为奇的是,我们看到了NA(因为colSums(Z2)x别名).如果要在y ~ x:z中执行对比,则可以执行以下任一操作:

Z1 <- Z[, 1]
lm(y ~ x:Z1)
#Coefficients:
#(Intercept)         x:Z1  
#    0.34728     -0.06571

Z1 <- Z[, 2]
lm(y ~ x:Z1)
#Coefficients:
#(Intercept)         x:Z1  
#     0.2318      -0.6860  

后一种情况可能是 contefranz 试图做的事情.

但是,我并不真正推荐这种黑客手段.当您将模型公式传递给lm等时,model.matrix.default试图为您提供最明智的构造.另外,实际上,我们希望使用拟合模型进行预测.如果您自己完成了虚拟编码,则在向predict提供newdata时会很困难.

I do not want main effect because it is collinear with a finer factor fixed effect, so it is annoying to have these NA.

In this example:

lm(y ~ x * z)

I want the interaction of x (numeric) and z (factor), but not the main effect of z.

解决方案

Introduction

R documentation of ?formula says:

The ‘*’ operator denotes factor crossing: ‘a * b’ interpreted as ‘a + b + a : b

So it sounds like that dropping main effect is straightforward, by just doing one of the following:

a + a:b  ## main effect on `b` is dropped
b + a:b  ## main effect on `a` is dropped
a:b      ## main effects on both `a` and `b` are dropped

Oh, really? No no no (too simple, too naive). In reality it depends on the variable class of a and b.

  • If none of them are factors, or only one one them is a factor, this is true;
  • If both of them are factors, no. You can never drop main effect (low-order effect) when interaction (high-order effect) is present.

This kind of behavior is due to a magic function called model.matrix.default, which constructs a design matrix from a formula. A numerical variable is just included as it is into a column, but a factor variable is automatically coded as many dummy columns. It is exactly this dummy recoding that is a magic. It is commonly believed that we can enable or disable contrasts to control it, but not really. We lose control of contrasts even in this simplest example. The problem is that model.matrix.default has its own rule when doing dummy encoding, and it is very sensitive to how you specify the model formula. It is exactly for this reason that we can't drop main effect when an interaction between two factors exists.


Interaction between a numeric and a factor

From your question, x is numeric and z is a factor. You can specify a model with interaction but not with main effect of z by

y ~ x + x:z

Since x is numeric, it is equivalent to do

y ~ x:z

The only difference here is parametrization (or how model.matrix.default does dummy encoding). Consider a small example:

set.seed(0)
y <- rnorm(10)
x <- rnorm(10)
z <- gl(2, 5, labels = letters[1:2])

fit1 <- lm(y ~ x + x:z)
#Coefficients:
#(Intercept)            x         x:zb  
#     0.1989      -0.1627      -0.5456  

fit2 <- lm(y ~ x:z)
#Coefficients:
#(Intercept)         x:za         x:zb  
#     0.1989      -0.1627      -0.7082 

From the names of the coefficients we see that in the 1st specification, z is contrasted so its 1st level "a" is not dummy encoded, while in the 2nd specification, z is not contrasted and both levels "a" and "b" are dummy encoded. Given that both specifications ends up with three coefficients, they are really equivalent (mathematically speaking, the design matrix in two cases have the same column space) and you can verify this by comparing their fitted values:

all.equal(fit1$fitted, fit2$fitted)
# [1] TRUE

So why is z contrasted in the first case? Because otherwise we have two dummy columns for x:z, and the sum of these two columns are just x, aliased with the existing model term x in the formula. In fact, in this case even if you require that you don't want contrasts, model.matrix.default will not obey:

model.matrix.default(y ~ x + x:z,
      contrast.arg = list(z = contr.treatment(nlevels(z), contrasts = FALSE)))
#   (Intercept)          x       x:zb
#1            1  0.7635935  0.0000000
#2            1 -0.7990092  0.0000000
#3            1 -1.1476570  0.0000000
#4            1 -0.2894616  0.0000000
#5            1 -0.2992151  0.0000000
#6            1 -0.4115108 -0.4115108
#7            1  0.2522234  0.2522234
#8            1 -0.8919211 -0.8919211
#9            1  0.4356833  0.4356833
#10           1 -1.2375384 -1.2375384

So why in the 2nd case is z not contrasted? Because if it is, we loose the effect of level "a" when constructing interaction. And even if you require a contrast, model.matrix.default will just ignore you:

model.matrix.default(y ~ x:z,
      contrast.arg = list(z = contr.treatment(nlevels(z), contrasts = TRUE)))
#   (Intercept)       x:za       x:zb
#1            1  0.7635935  0.0000000
#2            1 -0.7990092  0.0000000
#3            1 -1.1476570  0.0000000
#4            1 -0.2894616  0.0000000
#5            1 -0.2992151  0.0000000
#6            1  0.0000000 -0.4115108
#7            1  0.0000000  0.2522234
#8            1  0.0000000 -0.8919211
#9            1  0.0000000  0.4356833
#10           1  0.0000000 -1.2375384

Oh, amazing model.matrix.default. It is able to make the right decision!


Interaction between two factors

Let me reiterate it: There is no way to drop main effect when interaction is present.

I will not provide extra example here, as I have one in Why do I get NA coefficients and how does lm drop reference level for interaction. See the "Contrasts for interaction" section over there. In short, all the following specifications give the same model (they have the same fitted values):

~ year:treatment
~ year:treatment + 0
~ year + year:treatment
~ treatment + year:treatment
~ year + treatment + year:treatment
~ year * treatment

And in particular, the 1st specification leads to an NA coefficient.

So once the RHS of ~ contains an year:treatment, you can never ask model.matrix.default to drop main effects.

People inexperienced with this behavior are to be surprised when producing ANOVA tables.


Bypassing model.matrix.default

Some people consider model.matrix.default annoying as it does not appear to have a consistent manner in dummy encoding. A "consistent manner" in their view is to always drop the 1st factor level. Well, no problem, you can bypass model.matrix.default by manually doing the dummy encoding, and feed the resulting dummy matrix as a variable to lm, etc.

However, you still need model.matrix.default's help to easily do dummy encoding for a (yes, only one) factor variable. For example, for the variable z in our previous example, its full dummy encoding is the following, and you can retain all or some of its columns for regression.

Z <- model.matrix.default(~ z + 0)  ## no contrasts (as there is no intercept)
#   za zb
#1   1  0
#2   1  0
#3   1  0
#4   1  0
#5   1  0
#6   0  1
#7   0  1
#8   0  1
#9   0  1
#10  0  1
#attr(,"assign")
#[1] 1 1
#attr(,"contrasts")
#attr(,"contrasts")$z
#[1] "contr.treatment"

Back to our simple example, if we don't want contrasts for z in y ~ x + x:z, we can do

Z2 <- Z[, 1:2]  ## use "[" to remove attributes of `Z`
lm(y ~ x + x:Z2)
#Coefficients:
#(Intercept)            x       x:Z2za       x:Z2zb  
#     0.1989      -0.7082       0.5456           NA

Not surprisingly we see an NA (because colSums(Z2) is aliased with x). And if we want to enforce contrasts in y ~ x:z, we can do either of the following:

Z1 <- Z[, 1]
lm(y ~ x:Z1)
#Coefficients:
#(Intercept)         x:Z1  
#    0.34728     -0.06571

Z1 <- Z[, 2]
lm(y ~ x:Z1)
#Coefficients:
#(Intercept)         x:Z1  
#     0.2318      -0.6860  

And the latter case is probably what contefranz is trying to do.

However, I do not really recommend this kind of hacking. When you pass a model formula to lm, etc, model.matrix.default is trying to give you the most sensible construction. Also, in reality we want to do prediction with a fitted model. If you have done dummy encoding yourself, you would have a hard time when providing newdata to predict.

这篇关于如何在R中使用公式排除主要效应但保留相互作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆