Logistic回归返回错误,但可以在简化的数据集中运行 [英] Logistic regression returns error but runs okay on reduced dataset
问题描述
非常感谢您在此方面的投入!
I would appreciate your input on this a lot!
我正在进行逻辑回归,但是由于某些原因,它不起作用:
I am working on a logistic regression, but it is not working for some reason:
mod1<-glm(survive~reLDM2+yr+yr2+reLDM2:yr +reLDM2:yr2+NestAge0,
family=binomial(link=logexp(NSSH1$exposure)),
data=NSSH1, control = list(maxit = 50))
当我用更少的数据运行相同的模型时,它将起作用! 但是使用完整的数据集,我会收到一条错误和警告消息:
When I run the same model with less data it works! But with the complete dataset I get an error and warning messages:
Error: inner loop 1; cannot correct step size
In addition: Warning messages:
1: step size truncated due to divergence
2: step size truncated due to divergence
这是数据: https://www.dropbox. com/s/8ib8m1fh176556h/NSSH1.csv?dl = 0
library(MASS)
logexp <- function(exposure = 1) {
linkfun <- function(mu) qlogis(mu^(1/exposure))
## FIXME: is there some trick we can play here to allow
## evaluation in the context of the 'data' argument?
linkinv <- function(eta) plogis(eta)^exposure
mu.eta <- function(eta) exposure * plogis(eta)^(exposure-1) *
.Call(stats:::C_logit_mu_eta, eta, PACKAGE = "stats")
valideta <- function(eta) TRUE
link <- paste("logexp(", deparse(substitute(exposure)), ")",
sep="")
structure(list(linkfun = linkfun, linkinv = linkinv,
mu.eta = mu.eta, valideta = valideta,
name = link),
class = "link-glm")
}
推荐答案
tl; dr 遇到麻烦了,因为您的yr
和yr2
预测变量(大概是年平方和年平方) )与异常链接功能结合使用会引起数字故障;您可以使用 glm2软件包来解决这个问题,但是我在这种情况下,至少可以考虑一下是否适合平方年的期限.
tl;dr you're getting in trouble because your yr
and yr2
predictors (presumably year and year-squared) are combining with an unusual link function to cause numerical trouble; you can get past this using the glm2 package, but I would give at least a bit of thought to whether it makes sense to try to fit the squared year term in this case.
更新:从下面开始使用mle2
进行暴力破解;尚未编写它来进行交互的完整模型.
Update: brute-force approach with mle2
started below; haven't yet written it to do the full model with interactions.
安德鲁·盖尔曼(Andrew Gelman)的民间定理可能在这里适用:
Andrew Gelman's folk theorem probably applies here:
当您遇到计算问题时,您的模型通常会出现问题.
When you have computational problems, often there’s a problem with your model.
我首先尝试了模型的简化版本,没有交互...
I started by trying a simplified version of your model, without the interactions ...
NSSH1 <- read.csv("NSSH1.csv")
source("logexpfun.R") ## for logexp link
mod1 <- glm(survive~reLDM2+yr+yr2+NestAge0,
family=binomial(link=logexp(NSSH1$exposure)),
data=NSSH1, control = list(maxit = 50))
...效果很好.现在,让我们尝试看看问题出在哪里:
... which works fine. Now let's try to see where the trouble is:
mod2 <- update(mod1,.~.+reLDM2:yr) ## OK
mod3 <- update(mod1,.~.+reLDM2:yr2) ## OK
mod4 <- update(mod1,.~.+reLDM2:yr2+reLDM2:yr) ## bad
好的,因此我们无法同时包含两个互动.这些预测变量实际上是如何相互关联的?让我们看看...
OK, so we're having trouble including both interactions at once. How are these predictors actually related to each other? Let's see ...
pairs(NSSH1[,c("reLDM2","yr","yr2")],gap=0)
更新:当然,年"和年平方"看起来像这样!即使使用构造正交多项式的yr
和yr2
并不是完全相关的,但是它们之间的排名是完全相关的,并且它们在数字上相互干扰也就不足为奇了poly(yr,2)
,在这种情况下也无济于事...仍然值得看一下参数,以防它提供线索...
Update: of course "year" and "year-squared" look like this! even using yr
and yr2
aren't perfectly correlated, but they're perfectly rank-correlated and it's certainly not surprising that they're interfering with each other numericallypoly(yr,2)
, which constructs an orthogonal polynomial, doesn't help in this case ... still, it's worth looking at the parameters in case it provides a clue ...
如上所述,我们可以尝试glm2
(使用更强大的算法替代glm
),看看会发生什么...
As mentioned above, we can try glm2
(a drop-in replacement for glm
with a more robust algorithm) and see what happens ...
library(glm2)
mod5 <- glm2(survive~reLDM2+yr+yr2+reLDM2:yr +reLDM2:yr2+NestAge0,
family=binomial(link=logexp(NSSH1$exposure)),
data=NSSH1, control = list(maxit = 50))
现在我们确实得到了答案.如果检查cov2cor(vcov(mod5))
,我们会发现yr
和yr2
参数(以及与reLDM2
交互的参数)具有极强的负相关性(大约为-0.97).让我们直观地看到...
Now we do get an answer. If we check cov2cor(vcov(mod5))
, we see that the yr
and yr2
parameters (and the parameters for their interaction with reLDM2
are strongly negatively correlated (about -0.97). Let's visualize that ...
library(corrplot)
corrplot(cov2cor(vcov(mod5)),method="ellipse")
如果我们尝试用蛮力做到这一点怎么办?
What if we try to do this by brute force?
library(bbmle)
link <- logexp(NSSH1$exposure)
fit <- mle2(survive~dbinom(prob=link$linkinv(eta),size=1),
parameters=list(eta~reLDM2+yr+yr2+NestAge0),
start=list(eta=0),
data=NSSH1,
method="Nelder-Mead") ## more robust than default BFGS
summary(fit)
## Estimate Std. Error z value Pr(z)
## eta.(Intercept) 4.3627816 0.0402640 108.3545 < 2e-16 ***
## eta.reLDM2 -0.0019682 0.0011738 -1.6767 0.09359 .
## eta.yr -6.0852108 0.2068159 -29.4233 < 2e-16 ***
## eta.yr2 5.7332780 0.1950289 29.3971 < 2e-16 ***
## eta.NestAge0 0.0612248 0.0051272 11.9411 < 2e-16 ***
这似乎是合理的(您应该检查预测值,并确保它们有意义...),但是...
This seems reasonable (you should check predicted values and see that they make sense ...), but ...
cc <- confint(fit) ## "profiling has found a better solution"
这将返回一个mle2
对象,但是该对象的调用槽位置已损坏,因此打印结果很麻烦.
This returns an mle2
object, but one with a mangled call slot, so it's ugly to print the results.
coef(cc)
## eta.(Intercept) eta.reLDM2
## 4.329824508 -0.011996582
## eta.yr eta.yr2
## 0.101221970 0.093377127
## eta.NestAge0
## 0.003460453
##
vcov(cc) ## all NA values! problem?
尝试从那些返回的值重新启动...
Try restarting from those returned values ...
fit2 <- update(fit,start=list(eta=unname(coef(cc))))
coef(summary(fit2))
## Estimate Std. Error z value Pr(z)
## eta.(Intercept) 4.452345889 0.033864818 131.474082 0.000000e+00
## eta.reLDM2 -0.013246977 0.001076194 -12.309102 8.091828e-35
## eta.yr 0.103013607 0.094643420 1.088439 2.764013e-01
## eta.yr2 0.109709373 0.098109924 1.118229 2.634692e-01
## eta.NestAge0 -0.006428657 0.004519983 -1.422274 1.549466e-01
现在我们可以获得置信区间...
Now we can get confidence intervals ...
ci2 <- confint(fit2)
## 2.5 % 97.5 %
## eta.(Intercept) 4.38644052 4.519116156
## eta.reLDM2 -0.01531437 -0.011092655
## eta.yr -0.08477933 0.286279919
## eta.yr2 -0.08041548 0.304251382
## eta.NestAge0 -0.01522353 0.002496006
这似乎行得通,但我会非常对这些合适之处感到怀疑.您可能应该尝试使用其他优化器,以确保可以返回相同的结果.一些更好的优化工具,例如AD Model Builder或Template Model Builder,可能是个好主意.
This seems to work, but I would be very suspicious of these fits. You should probably try other optimizers to make sure you can get back to the same results. Some better optimization tool such as AD Model Builder or Template Model Builder might be a good idea.
我不认为无意中丢弃具有高度相关参数估计的预测变量,但这可能是考虑的合理时机.
I don't hold with mindlessly dropping predictors with strongly correlated parameter estimates, but this might be a reasonable time to consider it.
这篇关于Logistic回归返回错误,但可以在简化的数据集中运行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!