如何在R中绘制logistic glm预测值和置信区间 [英] How to plot logistic glm predicted values and confidence interval in R

查看:231
本文介绍了如何在R中绘制logistic glm预测值和置信区间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个存在/不存在响应变量和一个具有9个级别的因子变量的二项式 glm :

I have a binomial glm of a presence/absence response variable and a factor variable with 9 levels like this:

data$y<-factor(data$y,levels=c(0,1),labels=c("absent","present"))
table(data$y,data$site_name)

          Andulay Antulang Basak Dauin Poblacion District 1 Guinsuan Kookoo's Nest Lutoban Pier Lutoban South Malatapay Pier
  absent        4        4     1                          0        3             1            5             5              2
  present       2        2     5                          6        3             5            1             1              4

model <- glm(y~site_name,data=data,binomial)

为简便起见,仅跳过模型推断和验证,如何为每个站点绘制以其置信区间在箱图中出现"的概率?我想要的是中显示的内容R 中的时间间隔,但我想用箱线图显示它,因为我的回归变量site_name是一个具有9个水平的因子,而不是连续变量.

Just skipping the model inference and validation for brevity's sake, how do I plot per site a probability of getting "present" in a boxplot with its confidence interval? What I would like is kind of what is shown in Plot predicted probabilities and confidence intervals in R but I would like to show it with a boxplot, as my regression variable site_name is a factor with 9 levels, not a continuous variable.

我认为我可以按如下方式计算必要的值(但不能100%确定正确性):

I think I can calculate the necessary values as follows (but am not 100% sure about the correctness):

将模型系数转换回成功概率的函数:

Function to convert the model coefficients back to probabilities of success:

calc_val <- function(x){return(round(1/(1+1/(exp(x))),3))}

基于模型的预测概率:

prob <- tapply(predict(model,type="response"),data$site_name,function(x){round(mean(x),3)})
means <- as.data.frame(prob)

预测概率的75%和95%置信区间:

75% and 95% confidence intervals for the predicted probabilities:

ci <- cbind(confint(model,level=0.9),confint(model,level=0.5))
rownames(ci) <- gsub("site_name","",rownames(ci))
ci <- t(apply(ci,1,calc_val))

将它们全部合并到一个表中

Join it all together in one table

ci<-cbind(means,ci)
ci
                            prob   5 %  95 %  25 %  75 %   Pr(>|z|) stderr
Andulay                    0.333 0.091 0.663 0.214 0.469 0.42349216  0.192
Antulang                   0.333 0.112 0.888 0.304 0.696 1.00000000  0.192
Basak                      0.833 0.548 0.993 0.802 0.964 0.09916496  0.152
Dauin Poblacion District 1 1.000 0.000    NA 0.000 1.000 0.99097988  0.000
Guinsuan                   0.500 0.223 0.940 0.474 0.819 0.56032414  0.204
Kookoo's Nest              0.833 0.548 0.993 0.802 0.964 0.09916496  0.152
Lutoban Pier               0.167 0.028 0.788 0.130 0.501 0.51171512  0.152
Lutoban South              0.167 0.028 0.788 0.130 0.501 0.51171512  0.152
Malatapay Pier             0.667 0.364 0.972 0.640 0.903 0.25767454  0.192

所以我的问题是双重的:

So my questions are twofold:

  1. 概率和置信区间的计算是否正确?
  2. 如何在bloxplot(箱形图和晶须图)中绘制此图?

编辑这是通过 dput 进行的一些示例数据(它还修改了上面的表以匹配数据):

EDIT Here is some sample data via dput (which also modified the tables above to match the data):

# dput(data[c("y", "site_name")])
data <- structure(list(y = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L), .Label = c("absent", "present"), class = "factor"), site_name = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 9L, 9L, 9L, 9L, 9L, 9L, 4L, 4L, 4L, 4L, 4L, 4L, 8L, 8L, 8L, 8L, 8L, 8L, 7L, 7L, 7L, 7L, 7L, 7L, 5L, 5L, 5L, 5L, 5L, 5L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 6L, 6L, 6L, 6L, 6L, 6L), .Label = c("Andulay", "Antulang", "Basak", "Dauin Poblacion District 1", "Guinsuan", "Kookoo's Nest", "Lutoban Pier", "Lutoban South", "Malatapay Pier"), class = "factor")), .Names = c("y", "site_name"), row.names = c(125L, 123L, 126L, 124L, 128L, 127L, 154L, 159L, 157L, 158L, 156L, 155L, 111L, 114L, 116L, 115L, 112L, 113L, 152L, 151L, 148L, 150L, 153L, 149L, 143L, 146L, 144L, 147L, 142L, 145L, 164L, 165L, 161L, 163L, 160L, 162L, 120L, 122L, 121L, 117L, 118L, 119L, 137L, 136L, 139L, 141L, 140L, 138L, 129L, 134L, 131L, 135L, 133L, 130L), class = "data.frame")
#

推荐答案

这是最低公分母,仅基础软件包的解决方案.

This is a lowest-common-denominator, base-package-only, solution.

适合模型:

mm <- glm(y~site_name,data=dd,family=binomial)

用网站名称组成预测框架:

Make up a prediction frame with the site names:

pframe <- data.frame(site_name=unique(dd$site_name))

预测(对数/线性预测标度),带有标准误差

Predict (on the logit/linear-predictor scale), with standard errors

pp <- predict(mm,newdata=pframe,se.fit=TRUE)
linkinv <- family(mm)$linkinv ## inverse-link function

将预测,上限和下限放在一起,然后逆变换为概率标度:

Put together the prediction, lower and upper bounds, and back-transform to the probability scale:

pframe$pred0 <- pp$fit
pframe$pred <- linkinv(pp$fit)
alpha <- 0.95
sc <- abs(qnorm((1-alpha)/2))  ## Normal approx. to likelihood
alpha2 <- 0.5
sc2 <- abs(qnorm((1-alpha2)/2))  ## Normal approx. to likelihood
pframe <- transform(pframe,
                    lwr=linkinv(pred0-sc*pp$se.fit),
                    upr=linkinv(pred0+sc*pp$se.fit),
                    lwr2=linkinv(pred0-sc2*pp$se.fit),
                    upr2=linkinv(pred0+sc2*pp$se.fit))

情节.

with(pframe,
{
    plot(site_name,pred,ylim=c(0,1))
    arrows(as.numeric(site_name),lwr,as.numeric(site_name),upr,
           angle=90,code=3,length=0.1)
})

作为箱形图:

with(pframe,
{
    bxp(list(stats=rbind(lwr,lwr2,pred,upr2,upr),
             n = rep(1,nrow(pframe)),
             conf = NA,
             out = NULL,
             group = NULL,
             names=as.character(site_name)))
})

还有很多其他方法可以做到这一点;我会推荐

There are lots of other ways to do this; I would recommend

library("ggplot2")
ggplot(pframe,aes(site_name,pred))+
     geom_pointrange(aes(ymin=lwr,ymax=upr))+
     geom_linerange(aes(ymin=lwr2,ymax=upr2),lwd=1.5)+
     coord_flip()

另一种解决方案是通过 y〜site_name-1 拟合模型,在这种情况下,该模型将为每个站点的概率分配一个单独的参数,并使用 profile()/ confint()来找到置信区间;这将比上面答案中依靠参数/预测的抽样分布的正态性要精确得多.

An alternative solution is to fit the model via y~site_name-1, which in this case will assign a separate parameter to the probability of each site, and use profile()/confint() to find the confidence intervals; this will be slightly more accurate than relying on the Normality of the sampling distributions of the parameters/predictions as done in the answer above.

这篇关于如何在R中绘制logistic glm预测值和置信区间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆