将线性混合模型拟合到非常大的数据集 [英] fitting a linear mixed model to a very large data set
问题描述
我想对以下格式的60M观察值运行混合模型(使用lme4::lmer
);除连续因变量tc
之外,所有预测变量/因变量均为分类(因子); patient
是随机拦截项的分组变量.我有64位R和16Gb RAM,并且正在中央服务器上工作. RStudio是最新的服务器版本.
I want to run a mixed model (using lme4::lmer
) on 60M observations of the following format; all predictor/dependent variables are categorical (factors) apart from the continuous dependent variable tc
; patient
is the grouping variable for a random intercept term. I have 64-bit R and 16Gb RAM and I'm working from a central server. RStudio is the most recent server version.
model <- lmer(tc~sex+age+lho+atc+(1|patient),
data=master,REML=TRUE)
lho sex tc age atc patient
18 M 16.61 45-54 H 628143
7 F 10.52 12-15 G 2013855
30 M 92.73 35-44 N 2657693
19 M 24.92 70-74 G 2420965
12 F 17.44 65-69 A 2833610
31 F 7.03 75 and over A 1090322
3 F 28.59 70-74 A 2718649
29 F 4.09 75 and over C 384578
16 F 67.22 65-69 R 1579355
23 F 7.7 70-74 C 896374
我遇到了cannot allocate a vector of 25.5Gb
错误.我在服务器上分配了40Gb并使用25,所以我想那意味着我还需要10左右.我认为我无法分配任何额外的空间.
I'm getting a cannot allocate a vector of 25.5Gb
error. I'm assigned 40Gb on the server and am using 25 so I guess that means I need another 10 or so. I don't think I can get any extra space assigned.
我不了解并行处理的第一件事,只是我现在使用的是四个内核之一.谁能为此模型建议并行代码,或者提供其他解决方案?
I don't know the first thing about parallel processing except that I'm using one of four cores at the moment. Can anyone suggest parallel code for this model, or perhaps a different fix?
推荐答案
正如Carl Witthoft指出的那样,R中的标准并行化工具使用共享内存模型,因此它们会使情况变得更糟比更好(它们的主要目的是通过使用多个处理器来加速计算绑定作业).
As pointed out by Carl Witthoft, the standard parallelization tools in R use a shared memory model, so they will make things worse rather than better (their main purpose is to accelerate compute-bound jobs by using multiple processors).
在短期内,通过将分类固定效应预测变量(age
,atc
)视为随机效应,但迫使其方差变大,您可能能够节省一些内存.我不知道这是否足以救您?它将压缩很多固定效果的模型矩阵,但是模型框架仍将与模型对象一起存储/复制...
In the short term, you might be able to save some memory by treating the categorical fixed-effect predictors (age
, atc
) as random effects but forcing their variances to be large. I don't know if this will be enough to save you or not; it will compress the fixed-effect model matrix a lot, but the model frame will still be stored/replicated with the model object ...
dd1 <- read.table(header=TRUE,
text="lho sex tc age atc patient
18 M 16.61 45-54 H 628143
7 F 10.52 12-15 G 2013855
30 M 92.73 35-44 N 2657693
19 M 24.92 70-74 G 2420965
12 F 17.44 65-69 A 2833610
31 F 7.03 75_and_over A 1090322
3 F 28.59 70-74 A 2718649
29 F 4.09 75_and_over C 384578
16 F 67.22 65-69 R 1579355
23 F 7.7 70-74 C 896374")
n <- 1e5
set.seed(101)
dd2 <- with(dd1,
data.frame(tc=rnorm(n,mean=mean(tc),sd=sd(tc)),
lho=round(runif(n,min=min(lho),max=max(lho))),
sex=sample(levels(sex),size=n,replace=TRUE),
age=sample(levels(age),size=n,replace=TRUE),
atc=sample(levels(atc),size=n,replace=TRUE),
patient=sample(1:1000,size=n,replace=TRUE)))
library("lme4")
m1 <- lmer(tc~sex+(1|lho)+(1|age)+(1|atc)+(1|patient),
data=dd2,REML=TRUE)
随机效果自动按照从大到大的顺序进行排序
到最小数量的级别.遵循所描述的机制
在?modular
帮助页面上:
Random effects are automatically sorted in order from largest
to smallest number of levels. Following the machinery described
in the ?modular
help page:
lmod <- lFormula(tc~sex+(1|lho)+(1|age)+(1|atc)+(1|patient),
data=dd2,REML=TRUE)
names(lmod$reTrms$cnms) ## ordering
devfun <- do.call(mkLmerDevfun, lmod)
wrapfun <- function(tt,bigsd=1000) {
devfun(c(tt,rep(bigsd,3)))
}
wrapfun(1)
opt <- optim(fn=wrapfun,par=1,method="Brent",lower=0,upper=1000)
opt$fval <- opt$value ## rename/copy
res <- mkMerMod(environment(devfun), opt, lmod$reTrms, fr=lmod$fr)
res
您可以忽略分类术语的报告差异,并使用
ranef()
恢复其(未收缩的)估计值.
You can ignore the reported variances for the categorical terms, and use
ranef()
to recover their (unshrunk) estimates.
从长远来看,解决此问题的正确方法可能是将其与分布式内存模型并行化.换句话说,您希望将数据分块打包到不同的服务器中.使用?modular
中所述的机制来设置似然函数(实际上是REML标准函数),该函数给出参数的函数以表示数据子集的REML标准;然后运行一个中央优化器,该优化器采用一组参数并通过将参数提交给每个服务器,从每个服务器检索值并添加它们来评估REML标准.我看到的仅有的两个问题是(1)我实际上不知道如何在R中实现分布式内存计算(基于 Rmpi / doMPI 打包可能是正确的方法); (2)以默认方式实现lmer
时,固定效果参数会被剖析,而不是明确地成为参数向量的一部分.
In the long term, the proper way to do this problem is probably to parallelize it with a distributed-memory model. In other words, you would want to parcel the data out in chunks to different servers; use the machinery described in ?modular
to set up a likelihood function (actually a REML-criterion function) that gives the REML criterion for a subset of the data as a function of the parameters; then run a central optimizer that takes a set of parameters and evaluates the REML criterion by submitting the parameters to each server, retrieving the values from each server, and adding them. The only two problems I see with implementing this are (1) I don't actually know how to implement distributed-memory computation in R (based on this intro document it seems that the Rmpi/doMPI packages might be the right way to go); (2) in the default way that lmer
is implemented, the fixed-effects parameters are profiled out rather than being explicitly part of the parameter vector.
这篇关于将线性混合模型拟合到非常大的数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!