从R?中的zeroinfl对象没有预测到零吗? [英] No zeros predicted from zeroinfl object in R?

查看:147
本文介绍了从R?中的zeroinfl对象没有预测到零吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我创建了一个零膨胀负二项式模型,并想研究将多少个零分配给采样或结构零.如何在R中实现此功能.zeroinfl页面上的示例代码对我来说还不清楚.

I created a zero inflated negative binomial model and want to investigate how many of the zeros were partitioned out to sampling or structural zeros. How do I implement this in R. The example code on the zeroinfl page is not clear to me.

data("bioChemists", package = "pscl")

fm_zinb2 <- zeroinfl(art ~ . | ., data = bioChemists, dist = "negbin")

table(round(predict(fm_zinb2, type="zero"))) 
>   0   1 
> 891  24 

table(round(bioChemists$art))
    >   0   1   2   3   4   5   6   7   8   9  10  11  12  16  19 
    > 275 246 178  84  67  27  17  12   1   2   1   1   2   1   1 

这告诉我什么?

当我对数据进行同样的操作时,我得到的读数只是样本量在1以下?谢谢

When I do the same for my data I get a read out that just has the sample size listed under the 1? Thanks

推荐答案

Zeileis(2008)在论文中提供了详细信息,可在

The details are in the paper by Zeileis (2008) available at https://www.jstatsoft.org/article/view/v027i08/v27i08.pdf

收集有关predict函数对pscl库中每个模型的功能的所有解释的工作(几年,您的问题仍未得到答复)需要一点工作,并且被掩埋了(pp 19,23)在似然函数的数学表达式中(等式7、8).我已经将您的问题解释为意味着您希望/需要知道如何使用不同的type预测:

It's a little bit of work (a couple of years and your question was still unanswered) to gather together all the explanations of what the predict function does for each model in the pscl library, and it's buried (pp 19,23) in the mathematical expression of the likelihood function (equations 7, 8). I have interpreted your question to mean that you want/need to know how to use different types of predict:

  • 预期数量是多少? (type="response")
  • 超过零的(有条件的)预期概率是多少? (type="zero")
  • 任何计数的(边际)预期概率是多少? (type="prob")
  • 最后还有多少个预测零是多余的(例如抽样)而不是基于回归的(例如结构化的)?
  • What is the expected count? (type="response")
  • What is the (conditional) expected probability of an excess zero? (type="zero")
  • What is the (marginal) expected probability of any count? (type="prob")
  • And finally how many predicted zeros are excess (eg sampling) rather than regression based (ie structural)?

要读取pscl软件包随附的数据,请执行以下操作:

To read in the data that comes with the pscl package:

data("bioChemists", package = "pscl")

然后拟合零膨胀负二项式模型:

Then fit a zero-inflated negative binomial model:

fm_zinb2 <- zeroinfl(art ~ . | ., data = bioChemists, dist = "negbin")

如果您希望预测期望值,请使用

If you wish to predict the expected values, then you use

predict(fm_zinb2, type="response")[29:31]
       29        30        31 
0.5213736 1.7774268 0.5136430

因此,在此模型下,博士学位的最近3年中预期发表的文章数量是生化学家29和31的一半,而生化学家30则接近2.

So under this model, the expected number of articles published in the last 3 years of a PhD is one half for biochemists 29 and 31 and nearly 2 for biochemist 30.

但是我相信您追求的是过剩的零(以质量为零的点)的概率.此命令将执行此操作,并打印出第29到31行中项目的值(是的,我去钓鱼了!):

But I believe that you are after the probability of an excess zero (in the point mass at zero). This command does that and prints off the values for items in row 29 to 31 (yes I went fishing!):

predict(fm_zinb2, type="zero")[29:31]

它产生以下输出:

        29         30         31 
0.58120120 0.01182628 0.58761308 

因此,第29个项目为多余零(您称为抽样零,即非结构性零,因此未由协变量解释)的概率为58%,第30个为1.1%,并且第31位是59%.因此,预计这两名生物化学家的出版物将为零,并且超出了可以用各个协变量的负二项式回归进行解释的那部分.

So the probability that the 29th item is an excess zero (which you refer to as a sampling zero, i.e. a non-structural zero and hence not explained by the covariates) is 58%, for the 30th is 1.1%, and for the 31st is 59%. So that's two biochemists who are predicted to have zero publications, and this is in excess of those that can be explained by the negative binomial regression on the various covariates.

您已经列出了整个数据集中的这些预测概率

And you have tabulated these predicted probabilities across the whole dataset

table(round(predict(fm_zinb2, type="zero"))) 
  0   1 
891  24

因此,您的输出告诉您,只有24位生化学家可能是多余的零,即,预测的多余零的概率超过一半(由于四舍五入).

So your output tells you that only 24 biochemists were likely to be an excess zero, ie with a predicted probability of an excess zero that was over one-half (due to rounding).

如果您以百分比表制表10分制的表,可能会更容易解释

It would perhaps be easier to interpret if you tabulated into bins of 10 points on the percentage scale

table(cut(predict(fm_zinb2, type="zero"), breaks=seq(from=0,to=1,by=0.1))) 

给予

 (0,0.1] (0.1,0.2] (0.2,0.3] (0.3,0.4] (0.4,0.5] (0.5,0.6] 
     751        73        34        23        10        22 
(0.6,0.7] (0.7,0.8] (0.8,0.9]   (0.9,1] 
        2         0         0         0

因此您可以看到751个生物化学家不太可能是过量的零,但是22个生物化学家有机会在50-60%之间成为过量的零,而只有2个生物化学家有较高的可能性(60-70%).没有人极有可能成为多余的零. 图形化显示,可以显示为直方图

So you can see that 751 biochemists were unlikely to be an excess zero, but 22 biochemists have a chance of between 50-60% of being an excess zero, and only 2 have a higher chance (60-70%). No one was extremely likely to be an excess zero. Graphically, this can be shown in a histogram

hist(predict(fm_zinb2, type="zero"), col="slateblue", breaks=seq(0,0.7,by=.02))

您已列出每个生化专家的实际计数(无需四舍五入,因为这些是计数):

You tabulated the actual number of counts per biochemist (no rounding necessary, as these are counts):

table(bioChemists$art)
  0   1   2   3   4   5   6   7   8   9  10  11  12  16  19 
275 246 178  84  67  27  17  12   1   2   1   1   2   1   1

谁是特殊的生物化学家,发表了19篇论文?

Who is the special biochemist with 19 publications?

most_pubs <- max(bioChemists$art)
most_pubs
extreme_biochemist <- bioChemists$art==most_pubs
which(extreme_biochemist)

您可以获得每个生物化学家拥有任意数目的酒吧的估计的概率,正好是0,最大,这是令人难以置信的19!

You can obtain the estimated probability that each biochemist has any number of pubs, exactly 0 and up to the maximum, here an unbelievable 19!

preds <- predict(fm_zinb2, type="prob")
preds[extreme_biochemist,]

您可以为我们的一位特殊的生物化学家来看看,他有19种出版物(使用此处的R基作图,但ggplot更漂亮)

and you can look at this for our one special biochemist, who had 19 publications (using base R plotting here, but ggplot is more beautiful)

expected <- predict(fm_zinb2, type="response")[extreme_biochemist]
# barplot returns the midpoints for counts 0 up to 19
midpoints<-barplot(preds[extreme_biochemist,], 
  xlab="Predicted #pubs", ylab="Relative chance among biochemists")
# add 1 because the first count is 0
abline(v=midpoints[19+1],col="red",lwd=3)
abline(v=midpoints[round(expected)+1],col="yellow",lwd=3)

这表明,尽管我们预计生物化学家915的出版物为4.73,但在此模型下,2-3家酒吧的可能性更大,远不及实际的19家酒吧(红线).

and this shows that although we expect 4.73 publications for biochemist 915, under this model more likelihood is given to 2-3 pubs, nowhere near the actual 19 pubs (red line).

回到问题,针对生物化学家29, 超过零的概率为

Getting back to the question, for biochemist 29, the probability of an excess zero is

pzero <- predict(fm_zinb2, type="zero")
pzero[29]
       29 
0.5812012 

总体(略微)为零的概率为

The probability of a zero, overall (marginally) is

preds[29,1]
[1] 0.7320871

因此,相对于结构(即通过回归解释),零的预测概率的比例为:

So the proportion of predicted probability of a zero that is excess versus structural (ie explained by the regression) is:

pzero[29]/preds[29,1]
       29 
0.7938962

或者零以外的额外概率是:

Or the additional probability of a zero, beyond the chance of an excess zero is:

preds[29,1] - pzero[29]

       29 
0.1508859

生化学家29的实际出版物是

The actual number of publications for biochemist 29 is

bioChemists$art[29]
[1] 0

因此,预测生物化学家发表论文为零的原因很少能通过回归分析得到解释(20%),而大多数则没有(即超过80%).

So the reason that biochemist is predicted to have zero publications is very little explained by the regression (20%) and mostly not (ie excess, 80%).

总的来说,我们发现对于大多数生物化学家而言,情况并非如此.我们的生物化学家29是不寻常的,因为他们零客栈的机会大部分是过剩的,即通过回归无法解释.我们可以通过以下方式查看此信息:

And overall, we see that for most biochemists, this is not the case. Our biochemist 29 is unusual, since their chance of zero pubs is mostly excess, ie inexplicable by the regression. We can see this via:

hist(pzero/preds[,1], col="blue", xlab="Proportion of predicted probability of zero that is excess")

为您提供:

这篇关于从R?中的zeroinfl对象没有预测到零吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆