从具有随机截距的多级模型生成预测模拟 [英] Generating predictive simulations from a multilevel model with random intercepts

查看:161
本文介绍了从具有随机截距的多级模型生成预测模拟的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在努力了解在R中如何使用具有一组随机截距的多级线性回归模型为新数据生成预测模拟.按照此文本的146-147页上的示例,我可以执行此操作一个没有随机效应的简单线性模型的任务.我无法确定的是如何扩展设置以适应添加到该模型的因素的随机截距.

I am struggling to understand how, in R, to generate predictive simulations for new data using a multilevel linear regression model with a single set of random intercepts. Following the example on pp. 146-147 of this text, I can execute this task for a simple linear model with no random effects. What I can't wrap my head around is how to extend the set-up to accommodate random intercepts for a factor added to that model.

我将使用iris和一些伪造的数据来显示卡住的地方.我将从一个简单的线性模型开始:

I'll use iris and some fake data to show where I'm getting stuck. I'll start with a simple linear model:

mod0 <- lm(Sepal.Length ~ Sepal.Width, data = iris)

现在让我们使用该模型为250个新案例生成1,000个预测模拟.我将从弥补这些情况开始:

Now let's use that model to generate 1,000 predictive simulations for 250 new cases. I'll start by making up those cases:

set.seed(20912)
fakeiris <- data.frame(Sepal.Length = rnorm(250, mean(iris$Sepal.Length), sd(iris$Sepal.Length)),
                       Sepal.Width = rnorm(250, mean(iris$Sepal.Length), sd(iris$Sepal.Length)),
                       Species = sample(as.character(unique(iris$Species)), 250, replace = TRUE),
                       stringsAsFactors=FALSE)

按照上述文本中的示例,这是我要为这250个新案例中的每个案例获得1,000个预测模拟的方法:

Following the example in the aforementioned text, here's what I do to get 1,000 predictive simulations for each of those 250 new cases:

library(arm)
n.sims = 1000  # set number of simulations
n.tilde = nrow(fakeiris)  # set number of cases to simulate
X.tilde <- cbind(rep(1, n.tilde), fakeiris[,"Sepal.Width"])  # create matrix of predictors describing those cases; need column of 1s to multiply by intercept
sim.fakeiris <- sim(mod0, n.sims)  # draw the simulated coefficients
y.tilde <- array(NA, c(n.sims, n.tilde))  # build an array to hold results
for (s in 1:n.sims) { y.tilde[s,] <- rnorm(n.tilde, X.tilde %*% sim.fakeiris@coef[s,], sim.fakeiris@sigma[s]) }  # use matrix multiplication to fill that array

这很好用,现在我们可以执行诸如colMeans(y.tilde)的操作来检查这些模拟的中心趋势,并进行cor(colMeans(y.tilde), fakeiris$Sepal.Length)的操作以将它们与Sepal.Length的(假)观测值进行比较.

That works fine, and now we can do things like colMeans(y.tilde) to inspect the central tendencies of those simulations, and cor(colMeans(y.tilde), fakeiris$Sepal.Length) to compare them to the (fake) observed values of Sepal.Length.

现在,让我们尝试对该简单模型进行扩展,在该模型中,我们假设截距在不同的观察组之间变化-这里是物种.我将使用lme4包中的lmer()来估计与该描述相匹配的简单多级/分层模型:

Now let's try an extension of that simple model in which we assume that the intercept varies across groups of observations --- here, species. I'll use lmer() from the lme4 package to estimate a simple multilevel/hierarchical model that matches that description:

library(lme4)
mod1 <- lmer(Sepal.Length ~ Sepal.Width + (1 | Species), data = iris)

好的,那行得通,但是现在呢?我跑:

Okay, that works, but now what? I run:

sim.fakeiris.lmer <- sim(mod1, n.sims)

当我使用str()检查结果时,我看到它是具有三个组件的sim.merMod类的对象:

When I use str() to inspect the result, I see that it is an object of class sim.merMod with three components:

  • @fixedef,一个具有固定系数(截距和Sepal.Width)的模拟系数的1,000 x 2矩阵

  • @fixedef, a 1,000 x 2 matrix with simulated coefficients for the fixed effects (the intercept and Sepal.Width)

@ranef,一个具有随机系数(三种物质)的模拟系数的1,000 x 3矩阵

@ranef, a 1,000 x 3 matrix with simulated coefficients for the random effects (the three species)

@sigma,一个长度为1,000的矢量,其中包含与每个模拟相关的sigmas

@sigma, a vector of length 1,000 containing the sigmas associated with each of those simulations

在这种情况下,我不能全神贯注于如何扩展用于简单线性模型的矩阵构造和乘法,这又增加了另一个维度.我查看了文本,但我只能找到单个组(此处为物种)中单个案例的示例(第272-275页).我要执行的现实世界任务涉及针对256个新案例(职业足球比赛)运行此类模拟,这些模拟事件平均分布在32个小组(主队)中.非常感谢您能提供的任何帮助.

I can't wrap my head around how to extend the matrix construction and multiplication used for the simple linear model to this situation, which adds another dimension. I looked in the text, but I could only find an example (pp. 272-275) for a single case in a single group (here, species). The real-world task I'm aiming to perform involves running simulations like these for 256 new cases (pro football games) evenly distributed across 32 groups (home teams). I'd greatly appreciate any assistance you can offer.

附录.愚蠢的是,在发布此内容之前,我没有查看lme4simulate.merMod()的详细信息.我现在有了.看起来应该可以解决问题,但是当我运行simulate(mod0, nsim = 1000, newdata = fakeiris)时,结果只有150行.这些值看起来很合理,但是fakeiris中有250行(大小写).那150是哪里来的?

Addendum. Stupidly, I hadn't looked at the details on simulate.merMod() in lme4 before posting this. I have now. It seems like it should do the trick, but when I run simulate(mod0, nsim = 1000, newdata = fakeiris), the result has only 150 rows. The values look sensible, but there are 250 rows (cases) in fakeiris. Where is that 150 coming from?

推荐答案

一种可能性是使用merTools包中的predictInterval函数.该软件包即将提交给CRAN,但当前的开发版本可从GitHub下载,

One possibility is to use the predictInterval function from the merTools package. The package is about to be submitted to CRAN, but the current developmental release is available for download from GitHub,

    install.packages("devtools")
    devtools::install_github("jknowles/merTools")

要获得100个模拟的中位数和95%的可信区间:

To get the median and a 95% credible interval of 100 simulations:

    mod1 <- lmer(Sepal.Length ~ Sepal.Width + (1 | Species), data = iris)

    out <- predictInterval(mod1, newdata=fakeiris, level=0.95,
                           n.sims=100, stat="median")

默认情况下,predictInterval包括残差,但是您可以 通过以下方式关闭该功能:

By default, predictInterval includes the residual variation, but you can turn that feature off with:

    out2 <- predictInterval(mod1, newdata=fakeiris, level=0.95,
                           n.sims=100, stat="median", 
                           include.resid.var=FALSE)

希望这会有所帮助!

这篇关于从具有随机截距的多级模型生成预测模拟的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆