回归模型如何处理因子变量? [英] How do regression models deal with the factor variables?

查看:135
本文介绍了回归模型如何处理因子变量?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个包含因子和响应变量的数据. 我的问题:

Suppose I have a data with a factor and response variable. My questions:

  • 线性回归和混合效应模型如何与因子变量一起使用?
  • 如果我对因子变量(m3 and m4)的每个级别都有一个单独的模型,那么与模型m1m2有什么不同?
  • 哪个是最好的模型/方法?
  • How linear regression and mixed effect models work with the factor variables?
  • If I have a separate model for each level of the factor variable (m3 and m4), how does that differ with models m1 and m2?
  • Which one is the best model/approach?

作为示例,我在nlme程序包中使用Orthodont数据.

As an example I use Orthodont data in nlme package.

library(nlme)
data = Orthodont
data2 <- subset(data, Sex=="Male")
data3 <- subset(data, Sex=="Female")

m1 <- lm (distance ~ age + Sex, data = Orthodont) 
m2 <- lme(distance ~ age , data = Orthodont, random = ~ 1|Sex)

m3 <- lm(distance ~ age, data= data2
m4 <- lm(distance ~ age, data= data3)

推荐答案

问题1:线性回归和混合效应模型如何与因子变量一起使用?
A1:因子被编码为虚拟变量(1 = true,0 = false).
例如,模型1的系数为:

Q1: How linear regression and mixed effect models work with the factor variables?
A1: Factors are coded as dummy variables (1 = true, 0= false).
For example, model 1's coefficients are:

coef(m1)    #lm( distance ~ age + Sex)
#(Intercept)         age   SexFemale 
# 17.7067130   0.6601852  -2.3210227 

因此,计算距离为:
距离= 17.71 + 0.66 *年龄-2.32 * SexFemale
其中,SexFemale对于男性而言为0,对于女性而言为1.简化为:
男性:距离= 17.71 + 0.66 *年龄
女性:距离= 15.39 + 0.66 *年龄

Calculating distance is therefore:
Distance = 17.71 + 0.66*age - 2.32*SexFemale
where SexFemale is 0 for males and 1 for females. This simplifies to:
Male:     Distance = 17.71 + 0.66*age
Female: Distance = 15.39 + 0.66*age

如果模型具有更多类别(例如,超重,健康,体重不足),则会相应添加虚拟变量:
R代码:lm(距离〜年龄+ weightStatus)
计算:距离=年龄+体重超过健康+体重健康+体重不足
每种体重类型都会创建三个单独的系数,并根据个人的体重类型乘以0或1.

If the model has more categories (ex. overweight, healthy, underweight), the dummy variables are added accordingly:
R code: lm(distance ~ age + weightStatus)
Computations: Distance = age + weightIsOver + weightIsHealthy + weightIsUnder
Three separate coefficients for each weight type are created and multiplied by 0 or 1 depending on an individual's weight type.

问题2:如果我对因子变量的每个级别(m3m4)都有单独的模型,那么与模型m1m2有什么不同?
A2:斜率和截距根据您的模型而变化.
m1是多元线性回归(MLR),其中截距根据性别而变化,但年龄的斜率相同.我们也可以将其称为随机斜率.线性混合效应(LME)模型m2还指定了随性别变化的截距(1|Sex).
m3和m4〜随机斜率和随机截距模型,因为数据是分开的.

Q2: If I have a separate model for each level of the factor variable (m3 and m4), how does that differ with models m1 and m2?
A2: The slopes and intercepts change depending on your model.
m1 is a multiple linear regression (MLR) where intercept changes depending on sex but the slope for age is the same. We can also refer to this as random slopes. The linear mixed effects (LME) model m2 also specifies an intercept that varies by sex (1|Sex).
m3 and m4 ~ Random slopes and random intercepts models because data are separated.

让我们指定一个具有随机斜率和随机截距的LME:

Let's specify a LME with random slopes and random intercepts:

m2a <- lme(distance ~ age, data = Orthodont, random= ~ age | Sex,
            control = lmeControl(opt="optim"))  
            #Changed the optimizer to achieve convergence

结合系数使我们能够检查模型的结构:

Combining the coefficients allows us to examine how the models are structured:

#Combine the model coefficients
coefs <- rbind(
                coef(m1)[1:2],                     
                coef(m1)[1:2] + c(coef(m1)[3], 0), #female coefficient added to intercept
                coef(m2),
                coef(m2a),
                coef(m3),
                coef(m4)); names(coefs) <- c("intercept", "age")
model.coefs <- data.frame(
                   model = paste0("m", c(1,1,2,2,"2a", "2a",3,4)),
                   type  = rep(c("MLR", "LME randomIntercept", "LME randomSlopes", 
                                  "separate LM"), each=2),
                   Sex = rep(c("male","female"), 4), 
                   coefs, row.names = 1:8)

model.coefs
#  model              model2    Sex intercept       age  #intercept & slope 
#1    m1                 MLR   male  17.70671 0.6601852  #different   same 
#2    m1                 MLR female  15.38569 0.6601852  
#3    m2 LME randomIntercept   male  17.67197 0.6601852  #different   same
#4    m2 LME randomIntercept female  15.43622 0.6601852 
#5   m2a    LME randomSlopes   male  16.65625 0.7540780  #different  different
#6   m2a    LME randomSlopes female  16.91363 0.5236138
#7    m3         separate LM   male  16.34062 0.7843750  #different  different
#8    m4         separate LM female  17.37273 0.4795455

第三季度:哪种模式/方法最好?
A3:这取决于情况,但可能是混合效果模型.

在您的示例中,m3和m4彼此无关,并且每个性别固有具有不同的斜率.可以检查LME模型以确定是否需要随机斜率(例如anova(m2, m2a)).当您具有多个级别(例如,学校内班级中的学生)并且具有重复度量(同一主题或跨时间的多个度量)时,混合效果模型将具有多种用途.您还可以通过这些模型指定协方差结构.

Q3: Which one is the best model/approach?
A3: It depends on the situation but probably a mixed effects model.

In your example, m3 and m4 have no relation to each other and inherently have different slopes for each Sex. The LME models can be examined to determine whether random slopes are warranted (ex. anova(m2, m2a)). Mixed effect models are versatile when you have multiple levels (ex. students within classes within schools) and repeated measures (several measures on the same Subject or across Time). You can also specify covariance structures with these models.

要形象化这些不同的模型,让我们看一下Orthodont数据:

To visualize these different models, let's look at the Orthodont data:

library(ggplot)
gg <- ggplot(Orthodont, aes(age, distance, fill=Sex)) + theme_bw() +
        geom_point(shape=21, position= position_dodge(width=0.2)) +  
        stat_summary(fun.y = "mean", geom="point", size=8, shape=22, colour="black" ) +
        scale_fill_manual(values = c("Male" = "black", "Female" = "white"))

圆圈=原始数据,正方形=平均值.距离似乎随着年龄线性增加.男性比女性的距离更高.坡度也可能因性别而异,与男性相比,女性随年龄增长的距离较小. (注意:原始数据已在x轴上略微避开,以避免重复绘制.)

Circles = raw data, Squares = means. Distance appears to increase linearly with age. Males have higher distances than females. The slopes may vary by sex too, with females having a smaller increase in distance with age compared to males. (Note: raw data have been slightly dodged on the x-axis to avoid overplotting.)

将模型添加到数据中并放大:

Adding our models to the data and zooming in:

gg1 <- gg +  
            geom_abline(data = model.coefs, size=1.5,
               aes(slope = age, intercept = intercept, colour = type, linetype = Sex)) 
gg1 + coord_cartesian(ylim = c(21, 27)) #zoom in

在这里,我们看到具有随机截距的LME模型类似于MLR模型.具有随机截距和随机斜率的LME与子集数据上的独立LM相似.

Here, we see the LME model with random intercepts resembles the MLR model. The LME with random intercepts and random slopes resembles the separate LMs on the subsetted data.

最后,这是使用lme4包使m2等效的方法:

Finally, here is how to make the equivalent of m2 using the lme4 package:

m2 <- lme(distance ~ age , data = Orthodont, random = ~ 1|Sex)
library(lme4)
m5 <- lmer(distance ~ age + (1|Sex), data = Orthodont)  #same as m2

更多资源:
> (广义)线性混合模型常见问题解答
比较nlme 使用Orthodont数据.

More resources:
(Generalized) Linear Mixed Models FAQ
Comparing nlme and lme4 using Orthodont data.

这篇关于回归模型如何处理因子变量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆