model.matrix():在这种情况下为什么我无法控制对比度 [英] model.matrix(): why do I lose control of contrast in this case

查看:244
本文介绍了model.matrix():在这种情况下为什么我无法控制对比度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我们有一个玩具数据框:

Suppose we have a toy data frame:

x <- data.frame(x1 = gl(3, 2, labels = letters[1:3]),
                x2 = gl(3, 2, labels = LETTERS[1:3]))

我想构造一个模型矩阵

#    x1b x1c x2B x2C
# 1    0   0   0   0
# 2    0   0   0   0
# 3    1   0   1   0
# 4    1   0   1   0
# 5    0   1   0   1
# 6    0   1   0   1

作者:

model.matrix(~ x1 + x2 - 1, data = x,
             contrasts.arg = list(x1 = contr.treatment(letters[1:3]),
                                  x2 = contr.treatment(LETTERS[1:3])))

但实际上我得到了:

#   x1a x1b x1c x2B x2C
# 1   1   0   0   0   0
# 2   1   0   0   0   0
# 3   0   1   0   1   0
# 4   0   1   0   1   0
# 5   0   0   1   0   1
# 6   0   0   1   0   1
# attr(,"assign")
# [1] 1 1 1 2 2
# attr(,"contrasts")
# attr(,"contrasts")$x1
#   b c
# a 0 0
# b 1 0
# c 0 1

# attr(,"contrasts")$x2
#   B C
# A 0 0
# B 1 0
# C 0 1

我有点困惑:

  • 我已经通过显式对比矩阵来降低第一因素水平;
  • 我要求放下拦截器.

那我为什么要得到一个5列的模型矩阵?如何获得所需的模型矩阵?

Then why am I getting a model matrix with 5 columns? How can I get the model matrix I want?

推荐答案

每当我们在R级别上失去控制时,在C级别上一定会有一些默认的,不变的行为. c0>可以在R源代码包中找到:

Whenever we lose control of something at R level, there must be some default, unchangable behaviour at C level. C code for model.matrix.default() can be found in R source package at:

R-<release_number>/src/library/stats/src/model.c

我们可以在这里找到解释:

We can find the explanation here:

/* If there is no intercept we look through the factor pattern */
/* matrix and adjust the code for the first factor found so that */
/* it will be coded by dummy variables rather than contrasts. */

让我们用数据框对此做一个小测试

Let's make a small test on this, with a data frame

x <- data.frame(x1 = gl(2, 2, labels = letters[1:2]), x2 = sin(1:4))

  1. 如果RHS上只有x2,则可以成功删除拦截:

  1. if we only have x2 on the RHS, we can drop intercept successfully:

model.matrix(~ x2 - 1, data = x)
#          x2
#1  0.8414710
#2  0.9092974
#3  0.1411200
#4 -0.7568025

  • 如果RHS上只有x1,则不应用对比度:

  • if we have only x1 on the RHS, contrast is not applied:

    model.matrix(~ x1 - 1, data = x)
    #  x1a x1b
    #1   1   0
    #2   1   0
    #3   0   1
    #4   0   1
    

  • 当我们同时拥有x1x2时,不应用对比度:

  • when we have both x1 and x2, contrast is not applied:

    model.matrix(~ x1 + x2 - 1, data = x)
    #  x1a x1b         x2
    #1   1   0  0.8414710
    #2   1   0  0.9092974
    #3   0   1  0.1411200
    #4   0   1 -0.7568025
    

  • 这意味着尽管两者之间存在差异:

    This implies that while there is difference between:

    lm(y ~ x2, data = x)
    lm(y ~ x2 - 1, data = x)
    

    两者之间没有区别

    lm(y ~ x1, data = x)
    lm(y ~ x1 - 1, data = x)
    

    lm(y ~ x1 + x2, data = x)
    lm(y ~ x1 + x2 - 1, data = x)
    


    这种行为的原因不是为了确保数值稳定性,而是为了确保估计/预测的敏感性.如果我们在对x1施加对比度时确实放下了截距,则最终得到一个模型矩阵:


    The reason for such behaviour is not to ensure numerical stability, but to ensure the sensibility of estimation / prediction. If we really drop the intercept while applying contrast to x1, we end up with a model matrix:

        #  x1b
        #1   0
        #2   0
        #3   1
        #4   1
    

    结果是我们将级别a的估计限制为0.

    The effect is that we constrain estimation for level a to 0.

    在这篇文章中:如何在此线性模型中强制降落截距或等效截距?,我们有一个数据集:

    In this post: How can I force dropping intercept or equivalent in this linear model?, we have a dataset:

    #           Y    X1    X2
    #1  1.8376852  TRUE  TRUE
    #2 -2.1173739  TRUE FALSE
    #3  1.3054450 FALSE  TRUE
    #4 -0.3476706  TRUE FALSE
    #5  1.3219099 FALSE  TRUE
    #6  0.6781750 FALSE  TRUE
    

    此数据集中没有联合存在(X1 = FALSE, X2 = FALSE).但是从广义上讲,model.matrix()必须做一些安全且明智的事情.有偏颇的假设是,训练数据集中没有两个因子水平的联合存在意味着不需要预测它们.如果我们在应用对比度时确实丢弃了截距,则这种联合存在被限制为0.但是,该职位的OP故意想要这种非标准行为(由于某种原因),在这种情况下,我的答案给出了一种可能的解决方法

    There isn't joint existence (X1 = FALSE, X2 = FALSE) in this dataset. But in broad sense, model.matrix() has to do something safe and sensible. It is biased to assume that no joint existence of two factor levels in the training dataset implies that they need not be predicted. If we really drop intercept while applying contrast, such joint existence is constrained at 0. However, the OP of that post deliberately wants such non-standard behaviour (for some reason), in which case, a possible workaround was given in my answer there.

    这篇关于model.matrix():在这种情况下为什么我无法控制对比度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆