计算并绘制广义非线性模型的95%置信区间 [英] Calculate and plot 95% confidence intervals of a generalised nonlinear model

查看:167
本文介绍了计算并绘制广义非线性模型的95%置信区间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经使用 R nlme 和包含的 gnls()函数构建了几个广义的非线性最小二乘模型(指数衰减).我之所以不简单地使用基本的 nls()函数构建非线性最小二乘模型,是因为我希望能够对异方差建模以避免转换.我的模型看起来像这样:

 模型<-gnls(响应〜C * exp(k * Explanatory1)+ A,开始=列表(C = c(C1,C1),k = c(k1,k1),A = c(A1,A1)),参数=列表(C〜Explanatory2,k〜Explanatory2,A〜说明2),权重= varPower(),数据=数据) 

与简单的 nls()模型的主要区别在于 weights 参数,该参数可以通过解释变量对异方差进行建模.线性等效于 gnls()的是广义最小二乘法,它与 nlme gls()函数一起运行.

现在,我想在 R 中计算置信区间,并将其与我的模型拟合在 ggplot()( ggplot2 包)中.我对 gls()对象执行此操作的方式是:

  NewData<-data.frame(Explanatory1 = c(...),Explanatory2 = c(...))NewData $ fit<-预测(模型,newdata = NewData) 

到目前为止,一切正常,我的模型合适.

  modmat<-model.matrix(formula(model)[-2],NewData)int<-diag(modmat%*%vcov(model)%*%t(modmat))NewData $ lo<-with(NewData,fit-1.96 * sqrt(int))NewData $ hi<-with(NewData,fit + 1.96 * sqrt(int)) 

这部分不适用于 gnls(),因此我无法获得上下模型的预测.

由于这似乎不适用于 gnls()对象,因此我查阅了教科书以及先前提出的问题,但似乎都没有满足我的需求.我发现的唯一类似问题是

这是生成的图,看起来很整洁.但是,当我对原始数据应用相同的方法时,尽管两组中的数据分布相似,但在一个置信区间中却出现了一个奇怪的凸起:

有什么想法可能会发生这种情况吗?有可能获得更平滑的置信区间吗?

解决方案

我实现了自举解决方案.最初,我进行了标准的非参数引导,对观察进行了重采样,但这会产生95%的配置项,看起来好像很宽-我认为,这是因为这种引导方式无法维持在x分布中保持平衡(例如,通过重采样,最终可能没有观察到小x值).(也可能是我的代码中只有一个错误.)

第二次射击时,我切换为从初始拟合中重新采样残差并将其添加到预测值中;这是一个相当标准的方法,例如自举时间序列中(尽管我忽略了残差中自相关的可能性,这需要 block自举).

这是基本的引导程序重采样器.

 <代码> df $ res<-df $ y-df $ fitbootfun<-function(newdata = df,perturb = 0,boot_res = FALSE){开始<-coef(mgnls)##如果我们完全从先前拟合的系数开始,则结束##获得完全相同的答案?不知道这里发生了什么,但是##我们可以通过稍微扰动起始条件来解决它如果(扰动> 0){开始<-开始*符(长度(开始),1扰动,1+扰动)}如果(!boot_res){##引导原始数据dfboot<-df [sample(nrow(df),size = nrow(df),replace = TRUE),]} 别的 {##自举残差dfboot<-transform(df,y = fit + sample(res,size = nrow(df),replace = TRUE))}bootfit<-try(update(mgnls,开始=开始,data = dfboot),静默= TRUE)如果(继承(bootfit,"try-error"))返回(rep(NA,nrow(newdata)))预测(boo​​tfit,newdata = newdata)} 

  set.seed(101)bmat<-复制(500,bootfun(perturb = 0.1,boot_res = TRUE))##重新采样残差bmat2<-复制(500,bootfun(perturb = 0.1,boot_res = FALSE))##重新采样观察##构造信封(逐点百分比引导CI)df $ lwr<-apply(bmat,1,分位数,0.025,na.rm = TRUE)df $ upr<-apply(bmat,1,分位数,0.975,na.rm = TRUE)df $ lwr2<-apply(bmat2,1,分位数,0.025,na.rm = TRUE)df $ upr2<-apply(bmat2,1,分位数,0.975,na.rm = TRUE) 

现在绘制图片:

  ggplot(df,aes(x,y))+geom_point()+geom_ribbon(aes(yes = lwr,ymax = upr),colour = NA,alpha = 0.3)+geom_ribbon(aes(ymin = lwr2,ymax = upr2),fill ="red",colour = NA,alpha = 0.3)+geom_line(aes(y = fit))+theme_minimal() 

粉红色/浅红色区域是观察级别的引导程序CI(可疑);灰色区域是剩余的自举CI.

也可以尝试使用delta方法,但是(1)与自举法相比,它提出的假设/逼近度更高,并且(2)我没时间了.

I have built several generalised nonlinear least squares models (exponential decay) with the R package nlme and the contained gnls() function. The reason I do not simply build nonlinear least squares models with the base nls() function is because I want to be able to model heteroskedasticity to avoid transformation. My models looks something like this:

model <- gnls(Response ~ C * exp(k * Explanatory1) + A,
              start = list(C = c(C1,C1), k = c(k1,k1), A = c(A1,A1)),
              params = list(C ~ Explanatory2, k ~ Explanatory2, 
                            A ~ Explanatory2),
              weights = varPower(), 
              data = Data)

The key difference to a simple nls() model is the weights argument, which enables the modelling of heteroskedasticity by the explanatory variable(s). The linear equivalent to gnls() is generalised least squares, which is run with the gls() function of nlme.

Now I would like to calculate confidence intervals in R and plot them alongside my model fit in ggplot() (ggplot2 package). The way I would do this for a gls() object is this:

NewData <- data.frame(Explanatory1 = c(...), Explanatory2 = c(...)) 
NewData$fit <- predict(model, newdata = NewData)

Up to this stage everything works fine and I get my model fit.

modmat <-  model.matrix(formula(model)[-2], NewData)
int <- diag(modmat %*% vcov(model) %*% t(modmat))
NewData$lo <- with(NewData, fit - 1.96*sqrt(int))
NewData$hi <- with(NewData, fit + 1.96*sqrt(int))

This part doesn't work with gnls() so I cannot obtain my upper and lower model predictions.

Since this does not seem to work for gnls() objects, I have consulted textbooks as well as previously asked questions but none seem to fit my need. The only similar question I found was How to calculate confidence intervals for Nonlinear Least Squares in r?. In the top answer it was suggested to use either investr::predFit() or to build a model with drc::drm() and then use the regular predict() function. None of these solutions help me with gnls().

My current best solution is to calculate 95% confidence intervals for all three parameters (C, k, A) with the confint() function and then write two separate functions for the upper and lower confidence bounds, i.e. one using Cmin, kmin and Amin and one using Cmax, kmax and Amax. Then I use these functions to predict values that I then plot with ggplot(). However, I am not entirely satisfied with the result and am not sure if this approach is optimal.

Here is a minimal reproducible example, ignoring the second, categorical explanatory variable for simplicity:

# generate data
set.seed(10)
x <-  rep(1:100,2)
r <- rnorm(x, mean = 10, sd = sqrt(x^-1.3))
y <- exp(-0.05*x) + r
df <-  data.frame(x = x, y = y)

# find starting values
m <- nls(y ~ SSasymp(x, A, C, logk))
summary(m) # A = 9.98071, C = 10.85413, logk = -3.14108
plot(m) # clear heteroskedasticity

# fit generalised nonlinear least squares
require(nlme)
mgnls <- gnls(y ~ C * exp(k * x) + A, 
              start = list(C = 10.85413, k = -exp(-3.14108), A = 9.98071),
              weights = varExp(),
              data = df)
plot(mgnls) # more homogenous

# plot predicted values 
df$fit <- predict(mgnls)
require(ggplot2)
ggplot(df) +
  geom_point(aes(x, y)) +
  geom_line(aes(x, fit)) +
  theme_minimal()

Edit following Ben Bolker's answer

The standard nonparametric bootstrapping solution applied to a second simulated dataset, which is closer to my original data and includes a second, categorical explanatory variable:

# generate data
set.seed(2)
x <- rep(sample(1:100, 9), 12)
set.seed(15)
r <- rnorm(x, mean = 0, sd = 200*x^-0.8)
y <- c(200, 300) * exp(c(-0.08, -0.05)*x) + c(120, 100) + r
df <-  data.frame(x = x, y = y, 
                  group = rep(letters[1:2], length.out = length(x)))

# find starting values
m <- nls(y ~ SSasymp(x, A, C, logk))
summary(m) # A = 108.9860, C = 356.6851, k = -2.9356
plot(m) # clear heteroskedasticity

# fit generalised nonlinear least squares
require(nlme)
mgnls <- gnls(y ~ C * exp(k * x) + A, 
              start = list(C = c(356.6851,356.6851), 
                           k = c(-exp(-2.9356),-exp(-2.9356)), 
                           A = c(108.9860,108.9860)),
              params = list(C ~ group, k ~ group, A ~ group),
              weights = varExp(),
              data = df)
plot(mgnls) # more homogenous

# calculate predicted values 
new <- data.frame(x = c(1:100, 1:100),
                  group = rep(letters[1:2], each = 100))
new$fit <- predict(mgnls, newdata = new)

# calculate bootstrap confidence intervals
bootfun <- function(newdata) {
  start <- coef(mgnls)
  dfboot <- df[sample(nrow(df), size = nrow(df), replace = TRUE),]
  bootfit <- try(update(mgnls,
                        start = start,
                        data = dfboot),
                 silent = TRUE)
  if(inherits(bootfit, "try-error")) return(rep(NA, nrow(newdata)))
  predict(bootfit, newdata)
}

set.seed(10)
bmat <- replicate(500, bootfun(new))
new$lwr <- apply(bmat, 1, quantile, 0.025, na.rm = TRUE)
new$upr <- apply(bmat, 1, quantile, 0.975, na.rm = TRUE)

# plot data and predictions
require(ggplot2)
ggplot() +
  geom_point(data = df, aes(x, y, colour = group)) +
  geom_ribbon(data = new, aes(x = x, ymin = lwr, ymax = upr, fill = group), 
              alpha = 0.3) +
  geom_line(data = new, aes(x, fit, colour = group)) +
  theme_minimal()

This is the resulting plot, which looks neat. However, when I apply the same method to my original data, I get a strange bulge in one of the confidence intervals, despite a similar data distribution in both groups:

Any ideas why this might be happening? Is it possible to get smoother confidence intervals?

解决方案

I implemented a bootstrapping solution. I initially did standard nonparametric bootstrapping, which resamples observations, but this produces 95% CIs that look suspiciously wide — I think that this is because that form of bootstrapping fails to maintain the balance in the x-distribution (e.g. by resampling you could end up with no observations for small values of x). (It's also possible that there's just a bug in my code.)

As a second shot I switched to resampling the residuals from the initial fit and adding them to the predicted values; this is a fairly standard approach e.g. in bootstrapping time series (although I'm ignoring the possibility of autocorrelation in the residuals, which would require block bootstrapping).

Here's the basic bootstrap resampler.

df$res <- df$y-df$fit
bootfun <- function(newdata=df, perturb=0, boot_res=FALSE) {
    start <- coef(mgnls)
    ## if we start exactly from the previously fitted coefficients we end
    ## up getting all-identical answers? Not sure what's going on here, but
    ## we can fix it by perturbing the starting conditions slightly
    if (perturb>0) {
        start <- start * runif(length(start), 1-perturb, 1+perturb)
    }
    if (!boot_res) {
        ## bootstrap raw data
        dfboot <- df[sample(nrow(df),size=nrow(df), replace=TRUE),]
    } else {
        ## bootstrap residuals
        dfboot <- transform(df,
                            y=fit+sample(res, size=nrow(df), replace=TRUE))
    }
    bootfit <- try(update(mgnls,
                      start = start,
                      data=dfboot),
                   silent=TRUE)
    if (inherits(bootfit, "try-error")) return(rep(NA,nrow(newdata)))
    predict(bootfit,newdata=newdata)
}

set.seed(101)
bmat <- replicate(500,bootfun(perturb=0.1,boot_res=TRUE))   ## resample residuals
bmat2 <- replicate(500,bootfun(perturb=0.1,boot_res=FALSE)) ## resample observations
## construct envelopes (pointwise percentile bootstrap CIs)
df$lwr <- apply(bmat, 1, quantile, 0.025, na.rm=TRUE)
df$upr <- apply(bmat, 1, quantile, 0.975, na.rm=TRUE)
df$lwr2 <- apply(bmat2, 1, quantile, 0.025, na.rm=TRUE)
df$upr2 <- apply(bmat2, 1, quantile, 0.975, na.rm=TRUE)

Now draw the picture:

ggplot(df, aes(x,y)) +
    geom_point() +
    geom_ribbon(aes(ymin=lwr, ymax=upr), colour=NA, alpha=0.3) +
    geom_ribbon(aes(ymin=lwr2, ymax=upr2), fill="red", colour=NA, alpha=0.3) +
    geom_line(aes(y=fit)) +
    theme_minimal()

The pink/light-red region is the observation-level bootstrap CIs (suspicious); the gray region is the residual bootstrap CIs.

It would be nice to try the delta method as well but (1) it makes stronger assumptions/approximations than bootstrapping anyway and (2) I'm out of time.

这篇关于计算并绘制广义非线性模型的95%置信区间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆