预测和模型.矩阵在因子变量的水平内给出不同的预测均值 [英] predict and model.matrix give different predicted means within levels of a factor variable

查看:95
本文介绍了预测和模型.矩阵在因子变量的水平内给出不同的预测均值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题是由于以下另一个问题而引起的:

library(lme4)
sleep <- as.data.frame(sleepstudy)   #import the sleep data

我必须为年龄创建一个变量.

set.seed(13)  #set a seed for creating a new variable, age
sleep$age <- sample(1:3,length(sleep),rep=TRUE) #create a new variable, age
sleep$agegroup1 <- factor(sleep$age, levels = c(1,2,3), 
        labels = c("Children <15 years", "Adults 15-49 years", "Elderly 50+ years"))
table(sleep$agegroup)  #should have 3 age groups

运行模型

m1 <- lmer(Reaction ~ Days + agegroup1 + Days:agegroup1 + (Days | Subject), sleep) 
summary(m1)

# New data frame for predicted means
d <- seq(0,9,1)  # make a vector of days = 0 to 9
newdat1 <- data.frame(Days=d,      
                          agegroup1=factor(rep(levels(sleep$agegroup1),length(d))))
newdat1 <- newdat1[order(newdat1$Days,newdat1$agegroup1),]   #order by Days 
mm <- model.matrix(formula(m1,fixed.only=TRUE)[-2], newdat1)  #create the matrix

现在,我尝试使用模型矩阵以及预测函数输出预测均值:

newdat1$mm <- mm%*%fixef(m1)    
newdat1$predict <- predict(m1, newdata=newdat1, re.form=NA)
head(newdat1)

这里,来自模型矩阵的预测均值和预测函数不同;成人和儿童年龄段是相反的.

   Days          agegroup1       mm  predict
11    0 Adults 15-49 years 252.2658 252.8241
1     0 Children <15 years 252.8241 252.2658
21    0  Elderly 50+ years 249.1254 249.1254
2     1 Adults 15-49 years 262.3326 263.2674
22    1 Children <15 years 263.2674 262.3326
12    1  Elderly 50+ years 260.0171 260.0171

如果我再次使用因子标签运行该脚本,其字母顺序与级别的数字顺序相同,则会得到不同的结果:

#set new labels for agegroup
sleep$agegroup2 <- factor(sleep$age, levels = c(1,2,3), 
                        labels = c("0-15y", "15-49y", "50+y"))
m2 <- lmer(Reaction ~ Days + agegroup2 + Days:agegroup2 + (Days | Subject), sleep) 
summary(m2)

# New data frame for predicted means
d <- seq(0,9,1)  # make a vector of days = 0 to 9
newdat2 <- data.frame(Days=d,
                    agegroup2=factor(rep(levels(sleep$agegroup2),length(d))))
newdat2 <- newdat2[order(newdat2$Days,newdat2$agegroup2),]   #order by Days
mm <- model.matrix(formula(m2,fixed.only=TRUE)[-2], newdat2)
newdat2$mm <- mm%*%fixef(m2)   
newdat2$predict <- predict(m2, newdata=newdat2, re.form=NA)
head(newdat2)

在这里,来自模型矩阵的预测均值和预测函数相同.

   Days agegroup2       mm  predict
1     0     0-15y 252.2658 252.2658
11    0    15-49y 252.8241 252.8241
21    0      50+y 249.1254 249.1254
22    1     0-15y 262.3326 262.3326
2     1    15-49y 263.2674 263.2674
12    1      50+y 260.0171 260.0171

预测"似乎忽略了标签,而是专注于级别,而直接访问模型矩阵时,正确地关注了标签.那么,我的问题是,在尝试使用模型矩阵时是否总是需要确保因子水平和标签具有相同的顺序?还是有其他方法可以解决这个问题?

模型矩阵的列顺序和模型的固定效果的顺序必须匹配,以便正确地进行矩阵乘法以计算预测值.手".这意味着,是的,新数据集中因子水平的顺序必须与原始数据集中的顺序相同,才能像您一样使用model.matrixfixef.

您可以通过设置新数据集中因子水平的顺序来实现.只需使用原始数据集中的因子水平,即可最轻松地做到这一点.例如,在newdat1中,您可以执行以下操作:

factor(rep(levels(sleep$agegroup1), length(d)), levels = levels(sleep$agegroup1)))

This question arose as a result of another question posted here: non-conformable arguments error from lmer when trying to extract information from the model matrix

When trying to obtain predicted means from an lmer model containing a factor variable, the output varies depending on how the factor variable is specified.

I have a variable agegroup, which can be specified using the groups "Children <15 years", "Adults 15-49 years", "Elderly 50+ years" or "0-15y", "15-49y", "50+y". My choice matters because for the former, the alphabetical ordering of the labels differs from the numeric ordering of the levels. To illustrate this, I have again used the sleep data.

library(lme4)
sleep <- as.data.frame(sleepstudy)   #import the sleep data

I have to create a variable for age.

set.seed(13)  #set a seed for creating a new variable, age
sleep$age <- sample(1:3,length(sleep),rep=TRUE) #create a new variable, age
sleep$agegroup1 <- factor(sleep$age, levels = c(1,2,3), 
        labels = c("Children <15 years", "Adults 15-49 years", "Elderly 50+ years"))
table(sleep$agegroup)  #should have 3 age groups

run the model

m1 <- lmer(Reaction ~ Days + agegroup1 + Days:agegroup1 + (Days | Subject), sleep) 
summary(m1)

# New data frame for predicted means
d <- seq(0,9,1)  # make a vector of days = 0 to 9
newdat1 <- data.frame(Days=d,      
                          agegroup1=factor(rep(levels(sleep$agegroup1),length(d))))
newdat1 <- newdat1[order(newdat1$Days,newdat1$agegroup1),]   #order by Days 
mm <- model.matrix(formula(m1,fixed.only=TRUE)[-2], newdat1)  #create the matrix

Now, I try to output the predicted means using the model matrix and also the predict function:

newdat1$mm <- mm%*%fixef(m1)    
newdat1$predict <- predict(m1, newdata=newdat1, re.form=NA)
head(newdat1)

Here, the predicted means from the model matrix and the predict function are different; the Adults and Children age groups are inverted.

   Days          agegroup1       mm  predict
11    0 Adults 15-49 years 252.2658 252.8241
1     0 Children <15 years 252.8241 252.2658
21    0  Elderly 50+ years 249.1254 249.1254
2     1 Adults 15-49 years 262.3326 263.2674
22    1 Children <15 years 263.2674 262.3326
12    1  Elderly 50+ years 260.0171 260.0171

If I run this script again using factor labels for which the alphabetical ordering is the same as the numeric ordering of the levels, I get different results:

#set new labels for agegroup
sleep$agegroup2 <- factor(sleep$age, levels = c(1,2,3), 
                        labels = c("0-15y", "15-49y", "50+y"))
m2 <- lmer(Reaction ~ Days + agegroup2 + Days:agegroup2 + (Days | Subject), sleep) 
summary(m2)

# New data frame for predicted means
d <- seq(0,9,1)  # make a vector of days = 0 to 9
newdat2 <- data.frame(Days=d,
                    agegroup2=factor(rep(levels(sleep$agegroup2),length(d))))
newdat2 <- newdat2[order(newdat2$Days,newdat2$agegroup2),]   #order by Days
mm <- model.matrix(formula(m2,fixed.only=TRUE)[-2], newdat2)
newdat2$mm <- mm%*%fixef(m2)   
newdat2$predict <- predict(m2, newdata=newdat2, re.form=NA)
head(newdat2)

Here, the predicted means from the model matrix and the predict function are the same.

   Days agegroup2       mm  predict
1     0     0-15y 252.2658 252.2658
11    0    15-49y 252.8241 252.8241
21    0      50+y 249.1254 249.1254
22    1     0-15y 262.3326 262.3326
2     1    15-49y 263.2674 263.2674
12    1      50+y 260.0171 260.0171

Predict appears to ignore the labels and focus on the levels, while directly accessing the model-matrix correctly focusses on the labels. My question, then, is whether it is always necessary to ensure that factor levels and labels have the same order when trying to use the model matrix? Or is there some other way to overcome this problem?

解决方案

The order of columns of the model matrix and of the fixed effects from the model must match in order to correctly do the matrix multiplication to calculate the predicted values "by hand". This means, yes, the order of the levels of the factor in the new dataset must be the same as in the original dataset to use model.matrix and fixef as you did.

You can achieve this by setting the order of the factor levels in your new dataset. This is easiest to do by simply using the levels of the factor from the original dataset. For example, in newdat1 you can do:

factor(rep(levels(sleep$agegroup1), length(d)), levels = levels(sleep$agegroup1)))

这篇关于预测和模型.矩阵在因子变量的水平内给出不同的预测均值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆