为什么 glm.nb 会抛出“缺失值"?仅在非常特定的输入上出错 [英] Why does glm.nb throw a "missing value" error only on very specific inputs

查看:38
本文介绍了为什么 glm.nb 会抛出“缺失值"?仅在非常特定的输入上出错的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

glm.nb 对某些输入抛出异常错误.虽然有多种值会导致此错误,但即使非常更改输入也可以防止错误发生.

glm.nb throws an unusual error on certain inputs. While there are a variety of values that cause this error, changing the input even very slightly can prevent the error.

一个可重现的例子:

set.seed(11)
pop <- rnbinom(n=1000,size=1,mu=0.05)
glm.nb(pop~1,maxit=1000)

运行此代码会引发错误:

Running this code throws the error:

Error in while ((it <- it + 1) < limit && abs(del) > eps) { : 
  missing value where TRUE/FALSE needed

起初我认为这与算法不收敛有关.然而,我惊讶地发现即使非常改变输入也可以防止错误.例如:

At first I assumed that this had something to do with the algorithm not converging. However, I was surprised to find that changing the input even very slightly can prevent the error. For example:

pop[1000] <- pop[1000] + 1
glm.nb(pop~1,maxit=1000)

我发现它会在 1 到 500 之间的 19.4% 的种子上引发此错误:

I've found that it throws this error on 19.4% of the seeds between 1 and 500:

fit.with.seed = function(s) {
    set.seed(s)
    pop <- rnbinom(n=1000, size=1, mu=0.05)
    m = glm.nb(pop~1, maxit=1000)
}

errors = sapply(1:500, function(s) {
    is.null(tryCatch(fit.with.seed(s), error=function(e) NULL))
})

mean(errors)

我只发现一次提及在没有响应的线程上的任何地方都会出现此错误.

I've found only one mention of this error anywhere, on a thread with no responses.

什么可能导致此错误,以及如何修复它(除了每次 glm.nb 抛出错误时随机排列输入?)

What could be causing this error, and how can it be fixed (other than randomly permuting the inputs every time glm.nb throws an error?)

ETA:设置 control=glm.control(maxit=200,trace = 3) 发现 theta.ml 算法因变得非常大而中断,然后变成 -Inf,然后变成NaN:

ETA: Setting control=glm.control(maxit=200,trace = 3) finds that the theta.ml algorithm breaks by getting very large, then becoming -Inf, then becoming NaN:

theta.ml: iter67 theta =5.77203e+15
theta.ml: iter68 theta =5.28327e+15
theta.ml: iter69 theta =1.41103e+16
theta.ml: iter70 theta =-Inf
theta.ml: iter71 theta =NaN

推荐答案

这有点粗糙,但过去我已经能够通过使用直接最大值来解决 glm.nb 的问题似然估计(即 glm.nb 中没有使用巧妙的迭代估计算法)

It's a bit crude, but in the past I have been able to work around problems with glm.nb by resorting to straight maximum likelihood estimation (i.e. no clever iterative estimation algorithms as used in glm.nb)

一些探索/分析表明 theta 参数的 MLE 实际上是无限的.我决定在逆尺度上拟合它,以便我可以将边界设置为 0(更高级的版本会设置一个对数似然函数,该函数将在 theta=0 时恢复为 Poisson,但这将取消尝试想出一个快速的罐头解决方案).

Some poking around/profiling indicates that the MLE for the theta parameter is effectively infinite. I decided to fit it on the inverse scale, so that I could put a boundary at 0 (a fancier version would set up a log-likelihood function that would revert to Poisson at theta=zero, but that would undo the point of trying to come up with a quick, canned solution).

对于上面给出的两个不好的例子,这工作得相当好,尽管它确实警告参数拟合在边界上......

With two of the bad examples given above, this works reasonably well, although it does warn that the parameter fit is on the boundary ...

library(bbmle)
m1 <- mle2(Y~dnbinom(mu=exp(logmu),size=1/invk),
           data=d1,
           parameters=list(logmu~X1+X2+offset(X3)),
           start=list(logmu=0,invk=1),
           method="L-BFGS-B",
           lower=c(rep(-Inf,12),1e-8))

第二个例子实际上更有趣,因为它从数字上证明了 theta 的 MLE 本质上是无限的,即使我们有一个大小合适的数据集,它完全是由负二项式偏差生成的(或者我对某些东西感到困惑...)

The second example is actually more interesting because it demonstrates numerically that the MLE for theta is essentially infinite even though we have a good-sized data set that is exactly generated from negative binomial deviates (or else I'm confused about something ...)

set.seed(11);pop <- rnbinom(n=1000,size=1,mu=0.05);glm.nb(pop~1,maxit=1000)
m2 <- mle2(pop~dnbinom(mu=exp(logmu),size=1/invk),
           data=data.frame(pop),
           start=list(logmu=0,invk=1),
           method="L-BFGS-B",
           lower=c(-Inf,1e-8))

这篇关于为什么 glm.nb 会抛出“缺失值"?仅在非常特定的输入上出错的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆