R中的大型固定效应二项式回归 [英] Large fixed effects binomial regression in R

查看:128
本文介绍了R中的大型固定效应二项式回归的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要在具有480.000个条目和3个固定效果变量的相对较大的数据帧上进行逻辑回归.固定效果var A具有3233级,var B具有2326级,var C具有811级.因此,我总共有6370个固定效果.数据是横截面的.如果由于回归矩阵对于我的内存而言似乎太大而无法使用常规的glm函数运行此回归(我收到消息"Error: cannot allocate vector of size 22.9 Gb").我正在寻找在Macbook Air(OS X 10.9.5 8GB RAM)上运行此回归的替代方法.我还可以访问具有16GB RAM的服务器.

I need to run a logistic regression on a relatively large data frame with 480.000 entries with 3 fixed effect variables. Fixed effect var A has 3233 levels, var B has 2326 levels, var C has 811 levels. So all in all I have 6370 fixed effects. The data is cross-sectional. If I can't run this regression using the normal glm function because the regression matrix seems too large for my memory (I get the message "Error: cannot allocate vector of size 22.9 Gb"). I am looking for alternative ways to run this regression on my Macbook Air (OS X 10.9.5 8GB RAM). I also have access to a server with 16GB RAM.

我已经尝试以几种不同的方式解决该问题,但到目前为止,没有一个方法能令人满意的结果:

I have tried solving the issue in a few different ways but so far none led to satisfactory results:

lfe/felm : 使用 lfe软件包的felm回归函数,在运行回归之前减去固定效果.这非常有效,并且允许我在短短几分钟内将上述回归作为正常的线性模型运行.但是,lfe不支持逻辑回归和glms.因此,felm非常适合于了解适用于不同模型的模型,但不适用于最终的逻辑回归模型.

lfe/felm: Using the felm regression function of the lfe package that subtracts fixed effects before running the regression. This works perfectly and allowed my to run the above regression as a normal linear model in just a few minutes. However, lfe does not support logistic regressions and glms. So felm was great for getting an idea about model fit for different models but doesn't work for the final logistic regression models.

biglm/bigglm : 我考虑过使用 bigglm 将功能分解为更易于管理的块.但是,有几个来源(例如 link1

biglm/bigglm: I thought about using bigglm to break my function into more manageable chunks. However, several sources (e.g. link1, link2, link3) mention that in order for that to work, factor levels need be consistent across chunks, i.e. each chunk must contain at least one of each factor of each factor variable. Factor A and B contain levels that only appear once, so I can't split the sets into different chunks with consistent levels. If I delete 10 factors of fixed effect A and 8 factors of B (a minor change) I will only have factors with 4+ levels left, and splitting my data into 4 chunks will make it a lot more manageable already. However, then I still need to figure out how to sort my df in a way that will ensure that my 480.000 entries are sorted into 4 chunks in which each factor level of each of the 3 factors appearing at least once.

GlmmGS/glmgs : 包中具有相同名称的 glmmgs 函数使用"Gauss-Seidel"算法对logistic回归执行类似于lfe包的固定效果减法.不幸的是,该软件包已不再开发.对R来说还比较陌生,对统计没有很深的经验,我无法理解输出,也不知道如何以给我正常的效果尺寸",模型拟合",重要区间"指标,由glm回归摘要提供.

GlmmGS/glmgs: The glmmgs function in the package with the same name performs a fixed-effects subtraction like the lfe package for logistic regressions using a "Gauss-Seidel" Algorithm. Unfortunately, the package is no longer being developed. Being relatively new to R and having no deep experience with statistics I can't make sense of the output and have no idea of how to transform it in a way that would give me the normal "effect size", "model fit", "significance interval" indicators that glm regression summaries provide.

我向包裹的作者发送了一条消息.他们的回应如下:

I sent a message to the package's authors. They kindly responded as follows:

该软件包不提供与glm对象相同格式的输出.但是你 可以轻松计算出大多数拟合统计信息( 估计值,拟合优度)根据当前输出(在CRAN中) 版本,我认为当前输出是估计的向量 系数,以及相关的标准误差矢量;一样 协方差分量,但如果您担心它们,则不必担心 是没有随机效应的拟合模型).只注意 用于计算标准误差的协方差矩阵是 与关联的精度矩阵的对角线块的逆 高斯-赛德尔算法,因此他们往往低估了 联合可能性的标准误.我不维护 包装不再,我没有时间进入特定的 细节;包装背后的开创性理论可以在 手册中引用的文件,其他所有内容都需要解决 由您用笔和纸:).

The package provides no output in the same format of a glm object. However, you can easily calculate most of the fit statistics (standard error of the estimates, goodness of fit) given the current output (in the CRAN version, I believe that the current output is a vector of estimate of coefficients, and the associated vector of standard errors; same for the covariance components, but you need not worry about them if you are fitting model without random effects). Only beware that the covariance matrices used to calculated the standard errors are the inverse of the diagonal blocks of the precision matrix associated with the Gauss-Seidel algorithm, and so they tend to underestimate the standard errors of the joint likelihood. I am not maintaining the package any longer and I do not have time to get into the specific details; the seminal theory behind the package can be found in the paper referenced in the manual, everything else needs to be worked out by you with pen and paper :).

如果任何人都可以以一种未经统计方面的教育的人可以理解的方式(可能是不可能的)来解释如何轻松地计算大部分拟合统计量",或者提供R代码来说明如何做到这一点我非常有义务!

If anyone can explain how to "easily calculate most of the fit statistics" in a way that someone without any education in statistics can understand it (might be impossible) or provide R code that shows on example of how this could be done I would be much obliged!

革命分析: 我在虚拟机上安装了Revolution Analytics Enterprise,该虚拟机可在Mac上模拟Windows 7.该程序具有称为RxLogit的功能,该功能针对大型逻辑回归进行了优化.使用RxLogit函数可以得到the error (Failed to allocate 326554568 bytes. Error in rxCall("RxLogit", params) : bad allocation),因此该函数似乎也遇到了内存问题.但是,该软件使我能够在分布式计算集群上运行回归.因此,我可以通过在具有大量内存的群集上购买计算时间来解决问题".但是,我想知道革命分析程序是否提供了我不知道的任何公式或方法,这些公式或方法将使我能够进行某种类似于lfe的固定效果减法运算或类似于bigglm的分块运算,因素.

Revolution Analytics: I installed revolution analytics enterprise on a virtual machine that simulates Windows 7 on my Mac. The program has a function called RxLogit that is optimized for large logistic regressions. Using the RxLogit function I get the error (Failed to allocate 326554568 bytes. Error in rxCall("RxLogit", params) : bad allocation), so that function also seems too run into memory issues. However, the software enables me to run my regression on a distributed computing cluster. So I could just "kill the problem" by purchasing computing time on a cluster with lots of memory. However, I wonder if the revolution analytics program provides any formulas or methods that I don't know off that would allow me to do some kind of lfe-like fixed-effects subtraction operation or bigglm-like chunking operation that takes factors into account.

MatrixModels/glm4 : 有人建议我将MatrixModels包的glm4函数与sparse = TRUE属性一起使用,以加快计算速度.如果我对所有固定效果运行glm4回归,则会出现"Error in Cholesky(crossprod(from), LDL = FALSE) : internal_chm_factor: Cholesky factorization failed错误.如果仅对固定效果变量B OR A和C运行它,则计算将起作用并返回"glpModel"对象. glmmGS在将输出转换成对我来说有意义的形式时,我遇到了一些问题,因为标准的summary()方法似乎无法解决该问题.

MatrixModels/glm4: One person suggested I use the glm4 function of the MatrixModels package with the sparse = TRUE attribute to speed up calculation. If I run a glm4 regression with all fixed effects I get an "Error in Cholesky(crossprod(from), LDL = FALSE) : internal_chm_factor: Cholesky factorization failed" error. If I run it only with the fixed effect variables B OR A and C, the calculation works and returns a "glpModel" object. As with glmmGS I have some issues in turning that output into a form that makes sense to me since the standard summary() method does not seem to work on it.

对于上述任何问题,我也很乐意为您提供建议,或者在具有内存限制的R中运行具有多个较大固定影响的logistic回归的完全不同的方法.

I would be happy for advice on any of the issues mentioned above or also completely different approaches for running logistic regressions with multiple large fixed effects in R with memory constraints.

推荐答案

签出

glmmboot{glmmML}

http://cran.r-project.org/web/包/glmmML/glmmML.pdf

还有Brostrom和Holmberg撰写的不错的文档( http ://cran.r-project.org/web/packages/eha/vignettes/glmmML.pdf )

There is also a nice document by Brostrom and Holmberg (http://cran.r-project.org/web/packages/eha/vignettes/glmmML.pdf)

这是他们文档中的示例:

Here is the example from their document:

dat <- data.frame(y = rbinom(5000, size = 1, prob = 0.5),
               x = rnorm(5000), group = rep(1:1000, each = 5))
fit1 <- glm(y ~ factor(group) + x, data = dat, family = binomial)

require(glmmML)
fit2 <- glmmboot(y ~ x, cluster = group,data = dat)

计算时间差是巨大的"!

The computing time difference is "huge"!

这篇关于R中的大型固定效应二项式回归的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆