逻辑回归返回错误,但在减少的数据集上运行正常 [英] Logistic regression returns error but runs okay on reduced dataset

查看:29
本文介绍了逻辑回归返回错误,但在减少的数据集上运行正常的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

非常感谢您对此的意见!

我正在研究逻辑回归,但由于某种原因它不起作用:

mod1<-glm(survive~reLDM2+yr+yr2+reLDM2:yr +reLDM2:yr2+NestAge0,家庭=二项式(链接=logexp(NSSH1$exposure)),数据=NSSH1,控制=列表(最大= 50))

当我用更少的数据运行相同的模型时,它就起作用了!但是对于完整的数据集,我收到一条错误和警告消息:

错误:内循环1;无法修正步长此外: 警告消息:1:由于发散而截断步长2:由于发散而截断步长

这是数据:

yryr2 不是 完全 相关的,但它们是完全相关的,这当然不足为奇他们在数字上相互干扰 更新:当然年"和年平方"看起来像这样!即使使用构造正交多项式的 poly(yr,2),在这种情况下也无济于事......不过,值得查看参数,以防它提供线索......

如上所述,我们可以尝试 glm2(使用更强大的算法替代 glm),看看会发生什么......

库(glm2)mod5 <- glm2(survive~reLDM2+yr+yr2+reLDM2:yr +reLDM2:yr2+NestAge0,家庭=二项式(链接=logexp(NSSH1$exposure)),数据=NSSH1,控制=列表(最大= 50))

现在我们确实得到了答案.如果我们检查 cov2cor(vcov(mod5)),我们会看到 yryr2 参数(以及它们与 reLDM2 呈强负相关(约 -0.97).让我们想象一下......

库(corrplot)corrplot(cov2cor(vcov(mod5)),方法=椭圆")

如果我们试图用蛮力来做到这一点怎么办?

库(bbmle)链接 <- logexp(NSSH1$exposure)适合 <- mle2(survive~dbinom(prob=link$linkinv(eta),size=1),参数=列表(eta~reLDM2+yr+yr2+NestAge0),开始=列表(eta=0),数据=NSSH1,method="Nelder-Mead") ## 比默认的 BFGS 更健壮总结(适合)## 估计标准.误差 z 值 Pr(z)## eta.(拦截) 4.3627816 0.0402640 108.3545 <2e-16 ***## eta.reLDM2 -0.0019682 0.0011738 -1.6767 0.09359 .## eta.yr -6.0852108 0.2068159 -29.4233 <2e-16 ***## eta.yr2 5.7332780 0.1950289 29.3971 <2e-16 ***## eta.NestAge0 0.0612248 0.0051272 11.9411 <2e-16 ***

这似乎是合理的(您应该检查预测值并查看它们是否有意义...),但是...

cc <- confint(fit) ##分析找到了更好的解决方案"

这将返回一个 mle2 对象,但该对象具有错误的调用槽,因此打印结果很难看.

coef(cc)## eta.(拦截)eta.reLDM2## 4.329824508 -0.011996582## eta.yr eta.yr2## 0.101221970 0.093377127## eta.NestAge0## 0.003460453##vcov(cc) ## 所有 NA 值!问题?

尝试从这些返回值重新启动...

fit2 <- update(fit,start=list(eta=unname(coef(cc))))系数(摘要(fit2))## 估计标准.误差 z 值 Pr(z)## eta.(拦截) 4.452345889 0.033864818 131.474082 0.000000e+00## eta.reLDM2 -0.013246977 0.001076194 -12.309102 8.091828e-35## eta.yr 0.103013607 0.094643420 1.088439 2.764013e-01## eta.yr2 0.109709373 0.098109924 1.118229 2.634692e-01## eta.NestAge0 -0.006428657 0.004519983 -1.422274 1.549466e-01

现在我们可以得到置信区间......

ci2 <- confint(fit2)## 2.5% 97.5%## eta.(拦截) 4.38644052 4.519116156## eta.reLDM2 -0.01531437 -0.011092655## eta.yr -0.08477933 0.286279919## eta.yr2 -0.08041548 0.304251382## eta.NestAge0 -0.01522353 0.002496006

这似乎有效,但我会非常怀疑这些配合.您可能应该尝试其他优化器以确保可以返回相同的结果.一些更好的优化工具,例如 AD Model Builder 或 Template Model Builder 可能是个好主意.

我不同意无意识地丢弃具有强相关参数估计的预测变量,但这可能是考虑它的合理时机.

I would appreciate your input on this a lot!

I am working on a logistic regression, but it is not working for some reason:

mod1<-glm(survive~reLDM2+yr+yr2+reLDM2:yr +reLDM2:yr2+NestAge0,
         family=binomial(link=logexp(NSSH1$exposure)),
                       data=NSSH1, control = list(maxit = 50))

When I run the same model with less data it works! But with the complete dataset I get an error and warning messages:

Error: inner loop 1; cannot correct step size
In addition: Warning messages:
1: step size truncated due to divergence 
2: step size truncated due to divergence 

This is the data: https://www.dropbox.com/s/8ib8m1fh176556h/NSSH1.csv?dl=0

Log exposure link function from User-defined link function for glmer for known-fate survival modeling:

library(MASS)
logexp <- function(exposure = 1) {
    linkfun <- function(mu) qlogis(mu^(1/exposure))
    ## FIXME: is there some trick we can play here to allow
    ##   evaluation in the context of the 'data' argument?
    linkinv <- function(eta)  plogis(eta)^exposure
    mu.eta <- function(eta) exposure * plogis(eta)^(exposure-1) *
      .Call(stats:::C_logit_mu_eta, eta, PACKAGE = "stats")
    valideta <- function(eta) TRUE
    link <- paste("logexp(", deparse(substitute(exposure)), ")",
               sep="")
    structure(list(linkfun = linkfun, linkinv = linkinv,
               mu.eta = mu.eta, valideta = valideta, 
               name = link),
          class = "link-glm")
}

解决方案

tl;dr you're getting in trouble because your yr and yr2 predictors (presumably year and year-squared) are combining with an unusual link function to cause numerical trouble; you can get past this using the glm2 package, but I would give at least a bit of thought to whether it makes sense to try to fit the squared year term in this case.

Update: brute-force approach with mle2 started below; haven't yet written it to do the full model with interactions.

Andrew Gelman's folk theorem probably applies here:

When you have computational problems, often there’s a problem with your model.

I started by trying a simplified version of your model, without the interactions ...

NSSH1 <- read.csv("NSSH1.csv")
source("logexpfun.R")  ## for logexp link
mod1 <- glm(survive~reLDM2+yr+yr2+NestAge0,
          family=binomial(link=logexp(NSSH1$exposure)),
          data=NSSH1, control = list(maxit = 50))

... which works fine. Now let's try to see where the trouble is:

mod2 <- update(mod1,.~.+reLDM2:yr)  ## OK
mod3 <- update(mod1,.~.+reLDM2:yr2) ## OK
mod4 <- update(mod1,.~.+reLDM2:yr2+reLDM2:yr)  ## bad

OK, so we're having trouble including both interactions at once. How are these predictors actually related to each other? Let's see ...

pairs(NSSH1[,c("reLDM2","yr","yr2")],gap=0)

yr and yr2 aren't perfectly correlated, but they're perfectly rank-correlated and it's certainly not surprising that they're interfering with each other numerically Update: of course "year" and "year-squared" look like this! even using poly(yr,2), which constructs an orthogonal polynomial, doesn't help in this case ... still, it's worth looking at the parameters in case it provides a clue ...

As mentioned above, we can try glm2 (a drop-in replacement for glm with a more robust algorithm) and see what happens ...

library(glm2)
mod5 <- glm2(survive~reLDM2+yr+yr2+reLDM2:yr +reLDM2:yr2+NestAge0,
          family=binomial(link=logexp(NSSH1$exposure)),
          data=NSSH1, control = list(maxit = 50))

Now we do get an answer. If we check cov2cor(vcov(mod5)), we see that the yr and yr2 parameters (and the parameters for their interaction with reLDM2 are strongly negatively correlated (about -0.97). Let's visualize that ...

library(corrplot)
corrplot(cov2cor(vcov(mod5)),method="ellipse")

What if we try to do this by brute force?

library(bbmle)
link <- logexp(NSSH1$exposure)
fit <- mle2(survive~dbinom(prob=link$linkinv(eta),size=1),
     parameters=list(eta~reLDM2+yr+yr2+NestAge0),
     start=list(eta=0),
     data=NSSH1,
     method="Nelder-Mead")  ## more robust than default BFGS
summary(fit)
##                   Estimate Std. Error  z value   Pr(z)    
## eta.(Intercept)  4.3627816  0.0402640 108.3545 < 2e-16 ***
## eta.reLDM2      -0.0019682  0.0011738  -1.6767 0.09359 .  
## eta.yr          -6.0852108  0.2068159 -29.4233 < 2e-16 ***
## eta.yr2          5.7332780  0.1950289  29.3971 < 2e-16 ***
## eta.NestAge0     0.0612248  0.0051272  11.9411 < 2e-16 ***

This seems reasonable (you should check predicted values and see that they make sense ...), but ...

cc <- confint(fit)  ## "profiling has found a better solution"

This returns an mle2 object, but one with a mangled call slot, so it's ugly to print the results.

coef(cc)
## eta.(Intercept)                      eta.reLDM2 
##     4.329824508                    -0.011996582 
##       eta.yr                         eta.yr2 
##     0.101221970                     0.093377127 
##     eta.NestAge0 
##      0.003460453 
##
vcov(cc) ## all NA values! problem?

Try restarting from those returned values ...

fit2 <- update(fit,start=list(eta=unname(coef(cc))))
coef(summary(fit2))
##                     Estimate  Std. Error    z value        Pr(z)
## eta.(Intercept)  4.452345889 0.033864818 131.474082 0.000000e+00
## eta.reLDM2      -0.013246977 0.001076194 -12.309102 8.091828e-35
## eta.yr           0.103013607 0.094643420   1.088439 2.764013e-01
## eta.yr2          0.109709373 0.098109924   1.118229 2.634692e-01
## eta.NestAge0    -0.006428657 0.004519983  -1.422274 1.549466e-01

Now we can get confidence intervals ...

ci2 <- confint(fit2)
##                       2.5 %       97.5 %
## eta.(Intercept)  4.38644052  4.519116156
## eta.reLDM2      -0.01531437 -0.011092655
## eta.yr          -0.08477933  0.286279919
## eta.yr2         -0.08041548  0.304251382
## eta.NestAge0    -0.01522353  0.002496006

This seems to work, but I would be very suspicious of these fits. You should probably try other optimizers to make sure you can get back to the same results. Some better optimization tool such as AD Model Builder or Template Model Builder might be a good idea.

I don't hold with mindlessly dropping predictors with strongly correlated parameter estimates, but this might be a reasonable time to consider it.

这篇关于逻辑回归返回错误,但在减少的数据集上运行正常的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆