如何指定lm模型矩阵 [英] How to specify lm model matrix

查看:130
本文介绍了如何指定lm模型矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有从2组中获得的测量值,其中每组具有相同的3个水平. 这是我的示例data.frame:

df <- data.frame(measurement = c(rnorm(10,1,1),rnorm(10,0.75,1),rnorm(10,1.25,1),
                rnorm(10,0.5,1),rnorm(10,1.75,1),rnorm(10,0.25,1)),
                group = as.factor(c(rep("a",30),rep("b",30))),
                level = as.factor(rep(c(rep("x",10),rep("y",10),rep("z",10)),2)))

我有兴趣量化每个level中的measurementgroup的影响.

我猜想线性模型(lm)是实现此目的的合适方法,其中group:level交互项捕获了我感兴趣的效果.

是否有一种方法可以指定仅计算以下交互项的lm:groupb:levelxgroupb:levelygroupb:levelz?我相信这可以告诉我每个level如何受group"b"(相对于group"a")的影响,我认为这是我感兴趣的.

离我最近的是:

lm(measurement ~ 0 + group * level - group, data = df)

但这仍然可以计算levelxlevelylevelz的效果,而我对此并不感兴趣.

解决方案

就像上面提到的@Lyzander一样,您应该多澄清一些想要的内容.根据您所说的对于每个级别,测量如何受到"b"组(相对于"a"组的影响)",我猜有3种简单的方法可以做到这一点.

df <- data.frame(measurement = c(rnorm(10,1,1),rnorm(10,0.75,1),rnorm(10,1.25,1),
                                 rnorm(10,0.5,1),rnorm(10,1.75,1),rnorm(10,0.25,1)),
                 group = as.factor(c(rep("a",30),rep("b",30))),
                 level = as.factor(rep(c(rep("x",10),rep("y",10),rep("z",10)),2)))


library(dplyr)

#### calculate stats (mean values) ---------------------------------------------
df %>% group_by(level, group) %>% summarise(MeanMeasurement = mean(measurement))

#     level  group MeanMeasurement
#    (fctr) (fctr)           (dbl)
# 1      x      a       1.6708659
# 2      x      b       0.8487751
# 3      y      a       0.7977769
# 4      y      b       1.4209206
# 5      z      a       1.5484668
# 6      z      b      -0.3244225

#### build a model for each level ---------------------------------------------
summary(lm(measurement ~ group  , data = df[df$level=="x",]))

# Coefficients:
#   Estimate Std. Error t value Pr(>|t|)    
# (Intercept)   1.6709     0.3174   5.264 5.27e-05 ***
#   groupb       -0.8221     0.4489  -1.831   0.0837 .


summary(lm(measurement ~ group  , data = df[df$level=="y",]))

# Coefficients:
#   Estimate Std. Error t value Pr(>|t|)   
# (Intercept)   0.7978     0.2565   3.111  0.00604 **
#   groupb        0.6231     0.3627   1.718  0.10295


summary(lm(measurement ~ group  , data = df[df$level=="z",]))

# Coefficients:
#   Estimate Std. Error t value Pr(>|t|)    
# (Intercept)   1.5485     0.3549   4.363 0.000375 ***
#   groupb       -1.8729     0.5019  -3.731 0.001528 **



## build a model only with interactions ------------------------------------------
summary(lm(measurement ~ group : level , data = df))

# Coefficients: (1 not defined because of singularities)
# Estimate Std. Error t value Pr(>|t|)    
# (Intercept)    -0.3244     0.3123  -1.039 0.303452    
# groupa:levelx   1.9953     0.4416   4.518 3.43e-05 ***
#   groupb:levelx   1.1732     0.4416   2.657 0.010354 *  
#   groupa:levely   1.1222     0.4416   2.541 0.013951 *  
#   groupb:levely   1.7453     0.4416   3.952 0.000227 ***
#   groupa:levelz   1.8729     0.4416   4.241 8.76e-05 ***
#   groupb:levelz       NA         NA      NA       NA 

如果您查看统计信息(第一种方法)和模型系数,您会发现所有这三种方法都彼此一致.

我将采用第二种方法,因为它是唯一一种可以提供有关levelgroup(a与b)的差异在统计上是否显着的信息.第一种方法只是报告手段.第三种方法包括p值,但它们对应于具有基线交互作用值的比较,而不对应于a组和b组之间的比较.

您提到仅计算这些交互项:groupb:levelx,groupb:levely和groupb:levelz",这意味着您将不会获得a和x,y,z的其他3种交互.换句话说,您强制模型包含这3个交互.

您可以像这样手动进行

df <- data.frame(measurement = c(rnorm(10,1,1),rnorm(10,0.75,1),rnorm(10,1.25,1),
                                 rnorm(10,0.5,1),rnorm(10,1.75,1),rnorm(10,0.25,1)),
                 group = as.factor(c(rep("a",30),rep("b",30))),
                 level = as.factor(rep(c(rep("x",10),rep("y",10),rep("z",10)),2)))

library(dplyr)

df %>%
  mutate(interactions = paste0(group,":",level),
         interactions = ifelse(group=="a","a",interactions)) -> df2

summary(lm(measurement ~ interactions, data = df2))

# Coefficients:
#   Estimate Std. Error t value Pr(>|t|)    
# (Intercept)       0.9318     0.1831   5.089 4.36e-06 ***
#   interactionsb:x  -0.7803     0.3662  -2.131  0.03752 *  
#   interactionsb:y   0.2747     0.3662   0.750  0.45638    
# interactionsb:z  -1.1367     0.3662  -3.104  0.00299 ** 

,但是现在将其他3个交互合并在一起,并且每次将3个交互(b:x,b:y,b:z)中的每个交互与一般组a进行比较时.您没有在x,y和z中比较a与b,但在b组中比较了x与y与z.

I have measurements obtained from 2 groups where each group has the same 3 levels. Here's my example data.frame:

df <- data.frame(measurement = c(rnorm(10,1,1),rnorm(10,0.75,1),rnorm(10,1.25,1),
                rnorm(10,0.5,1),rnorm(10,1.75,1),rnorm(10,0.25,1)),
                group = as.factor(c(rep("a",30),rep("b",30))),
                level = as.factor(rep(c(rep("x",10),rep("y",10),rep("z",10)),2)))

I'm interested in quantifying how measurement in each level is affected by group.

I guess a linear model (lm) is the appropriate approach for this, where the group:level interaction terms capture the effects I'm interested in.

Is there a way to specify an lm that will only compute these interaction terms: groupb:levelx, groupb:levely, and groupb:levelz? I believe this tells me how each level is affected by group "b" (relative to group "a"), which I think is what I'm interested in.

The closest I got is:

lm(measurement ~ 0 + group * level - group, data = df)

But that still computes the effects of levelx, levely, and levelz, which I'm not interested in.

解决方案

As @Lyzander mentioned above you should clarify a bit more what you want. Based on what you said "how measurement is affected by group "b" (relative to group "a") for each level", I guess there are 3 simple ways to do that.

df <- data.frame(measurement = c(rnorm(10,1,1),rnorm(10,0.75,1),rnorm(10,1.25,1),
                                 rnorm(10,0.5,1),rnorm(10,1.75,1),rnorm(10,0.25,1)),
                 group = as.factor(c(rep("a",30),rep("b",30))),
                 level = as.factor(rep(c(rep("x",10),rep("y",10),rep("z",10)),2)))


library(dplyr)

#### calculate stats (mean values) ---------------------------------------------
df %>% group_by(level, group) %>% summarise(MeanMeasurement = mean(measurement))

#     level  group MeanMeasurement
#    (fctr) (fctr)           (dbl)
# 1      x      a       1.6708659
# 2      x      b       0.8487751
# 3      y      a       0.7977769
# 4      y      b       1.4209206
# 5      z      a       1.5484668
# 6      z      b      -0.3244225

#### build a model for each level ---------------------------------------------
summary(lm(measurement ~ group  , data = df[df$level=="x",]))

# Coefficients:
#   Estimate Std. Error t value Pr(>|t|)    
# (Intercept)   1.6709     0.3174   5.264 5.27e-05 ***
#   groupb       -0.8221     0.4489  -1.831   0.0837 .


summary(lm(measurement ~ group  , data = df[df$level=="y",]))

# Coefficients:
#   Estimate Std. Error t value Pr(>|t|)   
# (Intercept)   0.7978     0.2565   3.111  0.00604 **
#   groupb        0.6231     0.3627   1.718  0.10295


summary(lm(measurement ~ group  , data = df[df$level=="z",]))

# Coefficients:
#   Estimate Std. Error t value Pr(>|t|)    
# (Intercept)   1.5485     0.3549   4.363 0.000375 ***
#   groupb       -1.8729     0.5019  -3.731 0.001528 **



## build a model only with interactions ------------------------------------------
summary(lm(measurement ~ group : level , data = df))

# Coefficients: (1 not defined because of singularities)
# Estimate Std. Error t value Pr(>|t|)    
# (Intercept)    -0.3244     0.3123  -1.039 0.303452    
# groupa:levelx   1.9953     0.4416   4.518 3.43e-05 ***
#   groupb:levelx   1.1732     0.4416   2.657 0.010354 *  
#   groupa:levely   1.1222     0.4416   2.541 0.013951 *  
#   groupb:levely   1.7453     0.4416   3.952 0.000227 ***
#   groupa:levelz   1.8729     0.4416   4.241 8.76e-05 ***
#   groupb:levelz       NA         NA      NA       NA 

If you check the stats (1st approach) and the models' coefficients you'll see that all those 3 approaches agree with each other.

I'd go with the 2nd approach as it is the only one that gives you info about whether a difference of group (a vs b) within a level is statistically significant or not. 1st approach just reports means. 3rd approach includes p values but they correspond to comparisons with a baseline interaction value and not to comparisons between groups a and b.

You mentioned "ONLY compute these interaction terms: groupb:levelx, groupb:levely, and groupb:levelz" which means that you won't get the other 3 interactions of a and x,y,z. In other words you force your model to include those 3 interactions.

You can do that manually like this

df <- data.frame(measurement = c(rnorm(10,1,1),rnorm(10,0.75,1),rnorm(10,1.25,1),
                                 rnorm(10,0.5,1),rnorm(10,1.75,1),rnorm(10,0.25,1)),
                 group = as.factor(c(rep("a",30),rep("b",30))),
                 level = as.factor(rep(c(rep("x",10),rep("y",10),rep("z",10)),2)))

library(dplyr)

df %>%
  mutate(interactions = paste0(group,":",level),
         interactions = ifelse(group=="a","a",interactions)) -> df2

summary(lm(measurement ~ interactions, data = df2))

# Coefficients:
#   Estimate Std. Error t value Pr(>|t|)    
# (Intercept)       0.9318     0.1831   5.089 4.36e-06 ***
#   interactionsb:x  -0.7803     0.3662  -2.131  0.03752 *  
#   interactionsb:y   0.2747     0.3662   0.750  0.45638    
# interactionsb:z  -1.1367     0.3662  -3.104  0.00299 ** 

but now the other 3 interactions are combined and every time you compare each one of your 3 interactions (b:x, b:y, b:z) with the general group a. You don't compare a vs. b within x,y and z but you compare x vs. y vs. z within group b.

这篇关于如何指定lm模型矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆