随机拦截GLM [英] Random Intercept GLM

查看:134
本文介绍了随机拦截GLM的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在R中拟合随机截距互补对数-对数回归,以检查未观察到的用户异质性. 我通过互联网和书籍进行了搜索,但在Stata中仅找到一种解决方案,也许有人可以将其改编成R. 在Stata中,有2个命令可用:

I want to fit a random-intercept complementary log-log regression in R, in order to check for unobserved user heterogeneity. I have searched through the internet and books and have only found one solution in Stata, maybe someone can adapt that to R. In Stata there are 2 commands available:

  1. xtcloglog用于两级随机拦截
  2. gllamm用于随机系数和更高级别的模型
  1. xtcloglog for two-level random intercept
  2. gllamm for random-coefficient and and higher-levels models

我的数据与人的活动是否完成以及受到阳光的影响有关-completion是结果变量,sunshine和下面提到的其他变量是解释变量;这是一个简化的版本.

My data relates if activities from people are completed or not and affected by sunshine - completion is the outcome variable and sunshine and the others mentioned below would be the explanatory variable; this is a simplified version.

    581755 obs. of 68 variables:
     $ activity          : int  37033 37033 37033 37033 37033 37033 37033 37033 37033 37033 ...
     $ people         : int  5272 5272 5272 5272 5272 5272 5272 5272 5272 5272 ...
     $ completion: num 0 0 0 0 0 0 0 0 0 0 ...
     $ active            : int  0 2 2 2 2 2 2 2 2 2 ...
     $ overdue           : int  0 0 0 0 0 0 0 0 0 0 ...
     $ wdsp              : num  5.7 5.7 7.7 6.4 3.9 5.8 3.5 6.3 4.8 9.4 ...
     $ rain              : num  0 0 0 0 0 0 0 0 0 0 ...
     $ UserCompletionRate: num [1:581755, 1] NaN -1.55 -1.55 -1.55 -1.55 ...
      ..- attr(*, "scaled:center")= num 0.462
      ..- attr(*, "scaled:scale")= num 0.298
     $ DayofWeekSu       : num  0 0 0 0 0 1 0 0 0 0 ...
     $ DayofWeekMo       : num  0 0 0 0 0 0 1 0 0 0 ...
     $ DayofWeekTu       : num  1 0 0 0 0 0 0 1 0 0 ...
     $ DayofWeekWe       : num  0 1 0 0 0 0 0 0 1 0 ...
     $ DayofWeekTh       : num  0 0 1 0 0 0 0 0 0 1 ...
     $ DayofWeekFr       : num  0 0 0 1 0 0 0 0 0 0 ...

     $ MonthofYearJan    : num  0 0 0 0 0 0 0 0 0 0 ...
     $ MonthofYearFeb    : num  0 0 0 0 0 0 0 0 0 0 ...
     $ MonthofYearMar    : num  0 0 0 0 0 0 0 0 0 0 ...
     $ MonthofYearApr    : num  0 0 0 0 0 0 0 0 0 0 ...
     $ MonthofYearMay    : num  0 0 0 0 0 0 0 0 0 0 ...
     $ MonthofYearJun    : num  1 1 1 1 1 1 1 1 1 1 ...
     $ MonthofYearJul    : num  0 0 0 0 0 0 0 0 0 0 ...
     $ MonthofYearAug    : num  0 0 0 0 0 0 0 0 0 0 ...
     $ MonthofYearSep    : num  0 0 0 0 0 0 0 0 0 0 ...
     $ MonthofYearOct    : num  0 0 0 0 0 0 0 0 0 0 ...
     $ MonthofYearNov    : num  0 0 0 0 0 0 0 0 0 0 ...
     $ cold              : num  0 0 0 0 0 0 0 0 0 0 ...
     $ hot               : num  0 0 0 0 0 0 0 0 0 0 ...
     $ overduetask       : num  0 0 0 0 0 0 0 0 0 0 ...

原始(简化)数据:

 df <-  people = c(1,1,1,2,2,3,3,4,4,5,5),
        activity = c(1,1,1,2,2,3,4,5,5,6,6),
        completion = c(0,0,1,0,1,1,1,0,1,0,1),
        sunshine = c(1,2,3,4,5,4,6,2,4,8,4)

到目前为止,我已将以下代码用于博客:

So far I've used this code for the cloglog:

model <- as.formula("completion ~  sunshine")
clog_full = glm(model,data=df,family = binomial(link = cloglog))
summary(clog_full)

使用软件包glmmML:

Using package glmmML:

model_re <- as.formula("completion ~  sunshine")
clog_re = glmmML(model_re,cluster = people, data= df,
    family = binomial(link = cloglog)) 
summary(clog_re)

使用lme4软件包:

model_re1<- as.formula("completion ~  (1|people) + sunshine") 
clog_re1 <- glmer(model_re1, data=df,
   family = binomial(link = cloglog)) 
summary(clog_re1)

但是,R不会从中得到任何结果,只是运行它们而不会得到结果.我是否必须将人员或活动作为集群使用?

However, R does not get any result out of this, just runs them but never comes to an result. Would i have to use people or activities as a cluster?

如果有人对如何使用固定截距运行该模型有一个想法,我很高兴知道.

If anyone, also has an idea on how to run this model with a fixed intercept, I am happy to know.

推荐答案

这对我来说很好(或多或少:请参见下面的注释)

This works fine for me (more or less: see notes below)

## added data.frame()
df <-  data.frame(people = c(1,1,1,2,2,3,3,4,4,5,5),
        activity = c(1,1,1,2,2,3,4,5,5,6,6),
        completion = c(0,0,1,0,1,1,1,0,1,0,1),
        sunshine = c(1,2,3,4,5,4,6,2,4,8,4)
        )

model_re1 <- completion ~  (1|people) + sunshine
clog_re1 <- glmer(model_re1, data=df,
                  family = binomial(link = cloglog))

  • 这很快完成(不到一秒钟):也许您忘了关闭引号或括号或其他内容??
  • 但是,它的确会产生一条消息边界(奇异)拟合:请参阅?isSingular",这是因为您的数据集太小/噪声太大,导致人与人之间的变异的最佳估计为零(因为它不能为负).
  • 更新:很遗憾地告诉您,混合模型(GLMM)的计算强度比标准GLM大得多:带有68个预测变量的500K观测绝对是一个大问题,您应该期望适合花几个小时.我有一些建议:

    update: I'm sorry to tell you that mixed models (GLMMs) are significantly more computationally intensive than standard GLMs: 500K observations with 68 predictor variables is definitely a large problem, and you should expect the fit to take hours. I have a few suggestions:

    • 您绝对应该尝试使用数据的子集(观察值和预测变量),以了解计算时间将如何扩展
    • 据我所知,
    • glmmTMB程序包是R中解决该问题最快的选项(lme4会因大量的预测变量而严重缩放),但可能会遇到内存限制. MixedModels.jl Julia程序包可能会更快.
    • 您通常可以打开详细"或跟踪"选项,以使您知道模型至少在解决问题,而不是完全陷入困境(知道完成需要多长时间实际上是不可行的,但至少您知道发生了什么事...)
    • 如果Stata更快(我对此表示怀疑,但有可能),则可以使用它
    • you should definitely try out subsets of your data (both observations and predictor variables) to get a sense of how the computation time will scale
    • the glmmTMB package is as far as I know the fastest option within R for this problem (lme4 will scale badly with large numbers of predictor variables), but might run into memory constraints. The MixedModels.jl Julia package might be faster.
    • you can usually turn on a "verbose" or "tracing" option to let you know that the model is at least working on the problem, rather than completely stuck (it's not really feasible to know how long it will take to complete, but at least you know something is happening ...)
    • If Stata is much faster (I doubt it, but it's possible) you could use it

    这是一个具有10,000个观测值和单个预测变量的示例.

    Here's an example with 10,000 observations and a single predictor variable.

    n <- 10000
    set.seed(101)
    dd <- data.frame(people=factor(rep(1:10000,each=3)),
                     sunshine=sample(1:8,size=n, replace=TRUE))
    dd$completion <- simulate(~(1|people)+sunshine,
                              newdata=dd,
                              family=binomial(link="cloglog"),
                              newparams=list(beta=c(0,1),
                                             theta=1))[[1]]
    

    glmer运行80秒,然后失败:

    glmer runs for 80 seconds and then fails:

    system.time(update(clog_re1, data=dd, verbose=100))
    

    另一方面,glmmTMB在大约20秒内解决了此问题(我的计算机上有8个内核,而glmmTMB使用了所有这些内核,因此为此作业分配的CPU占用率高达750%;如果您的核心更少,那么经过的计算时间就会相应地增加).

    On the other hand, glmmTMB does this problem in about 20 seconds (I have 8 cores on my computer, and glmmTMB uses all of them, so the CPU allocation to this job goes up to 750%; if you have fewer cores the elapsed computational time will increase accordingly).

    library(glmmTMB)
    system.time(glmmTMB(model_re1,data=dd,family = binomial(link = cloglog),
                        verbose=TRUE))
    

    如果我再添加四个预测变量(总共5个),则计算时间将达到46秒(再次使用8个核:所有核的总计算时间为320秒).具有13倍的预测变量和50倍的观测值,您绝对应该期望这是一个具有挑战性的计算.

    If I add four more predictor variables (for a total of 5), the computation time goes up to 46 seconds (again using 8 cores: the total computation time across all cores is 320 seconds). With 13 times as many predictor variables and 50 times as many observations, you should definitely expect this to be a challenging computation.

    非常异质性的非常粗略评估是为了拟合齐次模型并将残差(或皮尔逊残差平方和)与模型的残差自由度进行比较;如果前者大得多,则表明存在某种形式的不匹配(异质性或其他原因).

    A very crude assessment of heterogeneity would be to fit the homogeneous model and compare the residual deviance (or sum of squared Pearson residuals) to the residual degrees of freedom of the model; if the former is much larger, that's evidence of some form of mis-fit (heterogeneity or something else).

    这篇关于随机拦截GLM的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆