逻辑回归返回错误,但在减少的数据集上运行正常 [英] Logistic regression returns error but runs okay on reduced dataset
问题描述
非常感谢您对此的意见!
我正在研究逻辑回归,但由于某种原因它不起作用:
mod1<-glm(survive~reLDM2+yr+yr2+reLDM2:yr +reLDM2:yr2+NestAge0,家庭=二项式(链接=logexp(NSSH1$exposure)),数据=NSSH1,控制=列表(最大= 50))
当我用更少的数据运行相同的模型时,它就起作用了!但是对于完整的数据集,我收到一条错误和警告消息:
错误:内循环1;无法修正步长此外: 警告消息:1:由于发散而截断步长2:由于发散而截断步长
这是数据:
更新:当然年"和年平方"看起来像这样!即使使用构造正交多项式的 yr
和 yr2
不是 完全 相关的,但它们是完全相关的,这当然不足为奇他们在数字上相互干扰poly(yr,2)
,在这种情况下也无济于事......不过,值得查看参数,以防它提供线索......
如上所述,我们可以尝试 glm2
(使用更强大的算法替代 glm
),看看会发生什么......
库(glm2)mod5 <- glm2(survive~reLDM2+yr+yr2+reLDM2:yr +reLDM2:yr2+NestAge0,家庭=二项式(链接=logexp(NSSH1$exposure)),数据=NSSH1,控制=列表(最大= 50))
现在我们确实得到了答案.如果我们检查 cov2cor(vcov(mod5))
,我们会看到 yr
和 yr2
参数(以及它们与 reLDM2
呈强负相关(约 -0.97).让我们想象一下......
库(corrplot)corrplot(cov2cor(vcov(mod5)),方法=椭圆")
如果我们试图用蛮力来做到这一点怎么办?
库(bbmle)链接 <- logexp(NSSH1$exposure)适合 <- mle2(survive~dbinom(prob=link$linkinv(eta),size=1),参数=列表(eta~reLDM2+yr+yr2+NestAge0),开始=列表(eta=0),数据=NSSH1,method="Nelder-Mead") ## 比默认的 BFGS 更健壮总结(适合)## 估计标准.误差 z 值 Pr(z)## eta.(拦截) 4.3627816 0.0402640 108.3545 <2e-16 ***## eta.reLDM2 -0.0019682 0.0011738 -1.6767 0.09359 .## eta.yr -6.0852108 0.2068159 -29.4233 <2e-16 ***## eta.yr2 5.7332780 0.1950289 29.3971 <2e-16 ***## eta.NestAge0 0.0612248 0.0051272 11.9411 <2e-16 ***
这似乎是合理的(您应该检查预测值并查看它们是否有意义...),但是...
cc <- confint(fit) ##分析找到了更好的解决方案"
这将返回一个 mle2
对象,但该对象具有错误的调用槽,因此打印结果很难看.
coef(cc)## eta.(拦截)eta.reLDM2## 4.329824508 -0.011996582## eta.yr eta.yr2## 0.101221970 0.093377127## eta.NestAge0## 0.003460453##vcov(cc) ## 所有 NA 值!问题?
尝试从这些返回值重新启动...
fit2 <- update(fit,start=list(eta=unname(coef(cc))))系数(摘要(fit2))## 估计标准.误差 z 值 Pr(z)## eta.(拦截) 4.452345889 0.033864818 131.474082 0.000000e+00## eta.reLDM2 -0.013246977 0.001076194 -12.309102 8.091828e-35## eta.yr 0.103013607 0.094643420 1.088439 2.764013e-01## eta.yr2 0.109709373 0.098109924 1.118229 2.634692e-01## eta.NestAge0 -0.006428657 0.004519983 -1.422274 1.549466e-01
现在我们可以得到置信区间......
ci2 <- confint(fit2)## 2.5% 97.5%## eta.(拦截) 4.38644052 4.519116156## eta.reLDM2 -0.01531437 -0.011092655## eta.yr -0.08477933 0.286279919## eta.yr2 -0.08041548 0.304251382## eta.NestAge0 -0.01522353 0.002496006
这似乎有效,但我会非常怀疑这些配合.您可能应该尝试其他优化器以确保可以返回相同的结果.一些更好的优化工具,例如 AD Model Builder 或 Template Model Builder 可能是个好主意.
我不同意无意识地丢弃具有强相关参数估计的预测变量,但这可能是考虑它的合理时机.
I would appreciate your input on this a lot!
I am working on a logistic regression, but it is not working for some reason:
mod1<-glm(survive~reLDM2+yr+yr2+reLDM2:yr +reLDM2:yr2+NestAge0,
family=binomial(link=logexp(NSSH1$exposure)),
data=NSSH1, control = list(maxit = 50))
When I run the same model with less data it works! But with the complete dataset I get an error and warning messages:
Error: inner loop 1; cannot correct step size
In addition: Warning messages:
1: step size truncated due to divergence
2: step size truncated due to divergence
This is the data: https://www.dropbox.com/s/8ib8m1fh176556h/NSSH1.csv?dl=0
Log exposure link function from User-defined link function for glmer for known-fate survival modeling:
library(MASS)
logexp <- function(exposure = 1) {
linkfun <- function(mu) qlogis(mu^(1/exposure))
## FIXME: is there some trick we can play here to allow
## evaluation in the context of the 'data' argument?
linkinv <- function(eta) plogis(eta)^exposure
mu.eta <- function(eta) exposure * plogis(eta)^(exposure-1) *
.Call(stats:::C_logit_mu_eta, eta, PACKAGE = "stats")
valideta <- function(eta) TRUE
link <- paste("logexp(", deparse(substitute(exposure)), ")",
sep="")
structure(list(linkfun = linkfun, linkinv = linkinv,
mu.eta = mu.eta, valideta = valideta,
name = link),
class = "link-glm")
}
tl;dr you're getting in trouble because your yr
and yr2
predictors (presumably year and year-squared) are combining with an unusual link function to cause numerical trouble; you can get past this using the glm2 package, but I would give at least a bit of thought to whether it makes sense to try to fit the squared year term in this case.
Update: brute-force approach with mle2
started below; haven't yet written it to do the full model with interactions.
Andrew Gelman's folk theorem probably applies here:
When you have computational problems, often there’s a problem with your model.
I started by trying a simplified version of your model, without the interactions ...
NSSH1 <- read.csv("NSSH1.csv")
source("logexpfun.R") ## for logexp link
mod1 <- glm(survive~reLDM2+yr+yr2+NestAge0,
family=binomial(link=logexp(NSSH1$exposure)),
data=NSSH1, control = list(maxit = 50))
... which works fine. Now let's try to see where the trouble is:
mod2 <- update(mod1,.~.+reLDM2:yr) ## OK
mod3 <- update(mod1,.~.+reLDM2:yr2) ## OK
mod4 <- update(mod1,.~.+reLDM2:yr2+reLDM2:yr) ## bad
OK, so we're having trouble including both interactions at once. How are these predictors actually related to each other? Let's see ...
pairs(NSSH1[,c("reLDM2","yr","yr2")],gap=0)
Update: of course "year" and "year-squared" look like this! even using yr
and yr2
aren't perfectly correlated, but they're perfectly rank-correlated and it's certainly not surprising that they're interfering with each other numericallypoly(yr,2)
, which constructs an orthogonal polynomial, doesn't help in this case ... still, it's worth looking at the parameters in case it provides a clue ...
As mentioned above, we can try glm2
(a drop-in replacement for glm
with a more robust algorithm) and see what happens ...
library(glm2)
mod5 <- glm2(survive~reLDM2+yr+yr2+reLDM2:yr +reLDM2:yr2+NestAge0,
family=binomial(link=logexp(NSSH1$exposure)),
data=NSSH1, control = list(maxit = 50))
Now we do get an answer. If we check cov2cor(vcov(mod5))
, we see that the yr
and yr2
parameters (and the parameters for their interaction with reLDM2
are strongly negatively correlated (about -0.97). Let's visualize that ...
library(corrplot)
corrplot(cov2cor(vcov(mod5)),method="ellipse")
What if we try to do this by brute force?
library(bbmle)
link <- logexp(NSSH1$exposure)
fit <- mle2(survive~dbinom(prob=link$linkinv(eta),size=1),
parameters=list(eta~reLDM2+yr+yr2+NestAge0),
start=list(eta=0),
data=NSSH1,
method="Nelder-Mead") ## more robust than default BFGS
summary(fit)
## Estimate Std. Error z value Pr(z)
## eta.(Intercept) 4.3627816 0.0402640 108.3545 < 2e-16 ***
## eta.reLDM2 -0.0019682 0.0011738 -1.6767 0.09359 .
## eta.yr -6.0852108 0.2068159 -29.4233 < 2e-16 ***
## eta.yr2 5.7332780 0.1950289 29.3971 < 2e-16 ***
## eta.NestAge0 0.0612248 0.0051272 11.9411 < 2e-16 ***
This seems reasonable (you should check predicted values and see that they make sense ...), but ...
cc <- confint(fit) ## "profiling has found a better solution"
This returns an mle2
object, but one with a mangled call slot, so it's ugly to print the results.
coef(cc)
## eta.(Intercept) eta.reLDM2
## 4.329824508 -0.011996582
## eta.yr eta.yr2
## 0.101221970 0.093377127
## eta.NestAge0
## 0.003460453
##
vcov(cc) ## all NA values! problem?
Try restarting from those returned values ...
fit2 <- update(fit,start=list(eta=unname(coef(cc))))
coef(summary(fit2))
## Estimate Std. Error z value Pr(z)
## eta.(Intercept) 4.452345889 0.033864818 131.474082 0.000000e+00
## eta.reLDM2 -0.013246977 0.001076194 -12.309102 8.091828e-35
## eta.yr 0.103013607 0.094643420 1.088439 2.764013e-01
## eta.yr2 0.109709373 0.098109924 1.118229 2.634692e-01
## eta.NestAge0 -0.006428657 0.004519983 -1.422274 1.549466e-01
Now we can get confidence intervals ...
ci2 <- confint(fit2)
## 2.5 % 97.5 %
## eta.(Intercept) 4.38644052 4.519116156
## eta.reLDM2 -0.01531437 -0.011092655
## eta.yr -0.08477933 0.286279919
## eta.yr2 -0.08041548 0.304251382
## eta.NestAge0 -0.01522353 0.002496006
This seems to work, but I would be very suspicious of these fits. You should probably try other optimizers to make sure you can get back to the same results. Some better optimization tool such as AD Model Builder or Template Model Builder might be a good idea.
I don't hold with mindlessly dropping predictors with strongly correlated parameter estimates, but this might be a reasonable time to consider it.
这篇关于逻辑回归返回错误,但在减少的数据集上运行正常的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!