Logistic回归返回错误,但可以在简化的数据集中运行 [英] Logistic regression returns error but runs okay on reduced dataset

查看:129
本文介绍了Logistic回归返回错误,但可以在简化的数据集中运行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

非常感谢您在此方面的投入!

I would appreciate your input on this a lot!

我正在进行逻辑回归,但是由于某些原因,它不起作用:

I am working on a logistic regression, but it is not working for some reason:

mod1<-glm(survive~reLDM2+yr+yr2+reLDM2:yr +reLDM2:yr2+NestAge0,
         family=binomial(link=logexp(NSSH1$exposure)),
                       data=NSSH1, control = list(maxit = 50))

当我用更少的数据运行相同的模型时,它将起作用! 但是使用完整的数据集,我会收到一条错误和警告消息:

When I run the same model with less data it works! But with the complete dataset I get an error and warning messages:

Error: inner loop 1; cannot correct step size
In addition: Warning messages:
1: step size truncated due to divergence 
2: step size truncated due to divergence 

这是数据: https://www.dropbox. com/s/8ib8m1fh176556h/NSSH1.csv?dl = 0

用户-已知命运生存建模的glmer定义的链接函数:

library(MASS)
logexp <- function(exposure = 1) {
    linkfun <- function(mu) qlogis(mu^(1/exposure))
    ## FIXME: is there some trick we can play here to allow
    ##   evaluation in the context of the 'data' argument?
    linkinv <- function(eta)  plogis(eta)^exposure
    mu.eta <- function(eta) exposure * plogis(eta)^(exposure-1) *
      .Call(stats:::C_logit_mu_eta, eta, PACKAGE = "stats")
    valideta <- function(eta) TRUE
    link <- paste("logexp(", deparse(substitute(exposure)), ")",
               sep="")
    structure(list(linkfun = linkfun, linkinv = linkinv,
               mu.eta = mu.eta, valideta = valideta, 
               name = link),
          class = "link-glm")
}

推荐答案

tl; dr 遇到麻烦了,因为您的yryr2预测变量(大概是年平方和年平方) )与异常链接功能结合使用会引起数字故障;您可以使用 glm2软件包来解决这个问题,但是我在这种情况下,至少可以考虑一下是否适合平方年的期限.

tl;dr you're getting in trouble because your yr and yr2 predictors (presumably year and year-squared) are combining with an unusual link function to cause numerical trouble; you can get past this using the glm2 package, but I would give at least a bit of thought to whether it makes sense to try to fit the squared year term in this case.

更新:从下面开始使用mle2进行暴力破解;尚未编写它来进行交互的完整模型.

Update: brute-force approach with mle2 started below; haven't yet written it to do the full model with interactions.

安德鲁·盖尔曼(Andrew Gelman)的民间定理可能在这里适用:

Andrew Gelman's folk theorem probably applies here:

当您遇到计算问题时,您的模型通常会出现问题.

When you have computational problems, often there’s a problem with your model.

我首先尝试了模型的简化版本,没有交互...

I started by trying a simplified version of your model, without the interactions ...

NSSH1 <- read.csv("NSSH1.csv")
source("logexpfun.R")  ## for logexp link
mod1 <- glm(survive~reLDM2+yr+yr2+NestAge0,
          family=binomial(link=logexp(NSSH1$exposure)),
          data=NSSH1, control = list(maxit = 50))

...效果很好.现在,让我们尝试看看问题出在哪里:

... which works fine. Now let's try to see where the trouble is:

mod2 <- update(mod1,.~.+reLDM2:yr)  ## OK
mod3 <- update(mod1,.~.+reLDM2:yr2) ## OK
mod4 <- update(mod1,.~.+reLDM2:yr2+reLDM2:yr)  ## bad

好的,因此我们无法同时包含两个互动.这些预测变量实际上是如何相互关联的?让我们看看...

OK, so we're having trouble including both interactions at once. How are these predictors actually related to each other? Let's see ...

pairs(NSSH1[,c("reLDM2","yr","yr2")],gap=0)

yryr2并不是完全相关的,但是它们之间的排名是完全相关的,并且它们在数字上相互干扰也就不足为奇了 更新:当然,年"和年平方"看起来像这样!即使使用构造正交多项式的poly(yr,2),在这种情况下也无济于事...仍然值得看一下参数,以防它提供线索...

yr and yr2 aren't perfectly correlated, but they're perfectly rank-correlated and it's certainly not surprising that they're interfering with each other numerically Update: of course "year" and "year-squared" look like this! even using poly(yr,2), which constructs an orthogonal polynomial, doesn't help in this case ... still, it's worth looking at the parameters in case it provides a clue ...

如上所述,我们可以尝试glm2(使用更强大的算法替代glm),看看会发生什么...

As mentioned above, we can try glm2 (a drop-in replacement for glm with a more robust algorithm) and see what happens ...

library(glm2)
mod5 <- glm2(survive~reLDM2+yr+yr2+reLDM2:yr +reLDM2:yr2+NestAge0,
          family=binomial(link=logexp(NSSH1$exposure)),
          data=NSSH1, control = list(maxit = 50))

现在我们确实得到了答案.如果检查cov2cor(vcov(mod5)),我们会发现yryr2参数(以及与reLDM2交互的参数)具有极强的负相关性(大约为-0.97).让我们直观地看到...

Now we do get an answer. If we check cov2cor(vcov(mod5)), we see that the yr and yr2 parameters (and the parameters for their interaction with reLDM2 are strongly negatively correlated (about -0.97). Let's visualize that ...

library(corrplot)
corrplot(cov2cor(vcov(mod5)),method="ellipse")

如果我们尝试用蛮力做到这一点怎么办?

What if we try to do this by brute force?

library(bbmle)
link <- logexp(NSSH1$exposure)
fit <- mle2(survive~dbinom(prob=link$linkinv(eta),size=1),
     parameters=list(eta~reLDM2+yr+yr2+NestAge0),
     start=list(eta=0),
     data=NSSH1,
     method="Nelder-Mead")  ## more robust than default BFGS
summary(fit)
##                   Estimate Std. Error  z value   Pr(z)    
## eta.(Intercept)  4.3627816  0.0402640 108.3545 < 2e-16 ***
## eta.reLDM2      -0.0019682  0.0011738  -1.6767 0.09359 .  
## eta.yr          -6.0852108  0.2068159 -29.4233 < 2e-16 ***
## eta.yr2          5.7332780  0.1950289  29.3971 < 2e-16 ***
## eta.NestAge0     0.0612248  0.0051272  11.9411 < 2e-16 ***

这似乎是合理的(您应该检查预测值,并确保它们有意义...),但是...

This seems reasonable (you should check predicted values and see that they make sense ...), but ...

cc <- confint(fit)  ## "profiling has found a better solution"

这将返回一个mle2对象,但是该对象的调用槽位置已损坏,因此打印结果很麻烦.

This returns an mle2 object, but one with a mangled call slot, so it's ugly to print the results.

coef(cc)
## eta.(Intercept)                      eta.reLDM2 
##     4.329824508                    -0.011996582 
##       eta.yr                         eta.yr2 
##     0.101221970                     0.093377127 
##     eta.NestAge0 
##      0.003460453 
##
vcov(cc) ## all NA values! problem?

尝试从那些返回的值重新启动...

Try restarting from those returned values ...

fit2 <- update(fit,start=list(eta=unname(coef(cc))))
coef(summary(fit2))
##                     Estimate  Std. Error    z value        Pr(z)
## eta.(Intercept)  4.452345889 0.033864818 131.474082 0.000000e+00
## eta.reLDM2      -0.013246977 0.001076194 -12.309102 8.091828e-35
## eta.yr           0.103013607 0.094643420   1.088439 2.764013e-01
## eta.yr2          0.109709373 0.098109924   1.118229 2.634692e-01
## eta.NestAge0    -0.006428657 0.004519983  -1.422274 1.549466e-01

现在我们可以获得置信区间...

Now we can get confidence intervals ...

ci2 <- confint(fit2)
##                       2.5 %       97.5 %
## eta.(Intercept)  4.38644052  4.519116156
## eta.reLDM2      -0.01531437 -0.011092655
## eta.yr          -0.08477933  0.286279919
## eta.yr2         -0.08041548  0.304251382
## eta.NestAge0    -0.01522353  0.002496006

这似乎行得通,但我会非常对这些合适之处感到怀疑.您可能应该尝试使用其他优化器,以确保可以返回相同的结果.一些更好的优化工具,例如AD Model Builder或Template Model Builder,可能是个好主意.

This seems to work, but I would be very suspicious of these fits. You should probably try other optimizers to make sure you can get back to the same results. Some better optimization tool such as AD Model Builder or Template Model Builder might be a good idea.

我不认为无意中丢弃具有高度相关参数估计的预测变量,但这可能是考虑的合理时机.

I don't hold with mindlessly dropping predictors with strongly correlated parameter estimates, but this might be a reasonable time to consider it.

这篇关于Logistic回归返回错误,但可以在简化的数据集中运行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆