用 R 制作零膨胀或障碍模型 [英] Making zero-inflated or hurdle model with R

查看:52
本文介绍了用 R 制作零膨胀或障碍模型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要制作一个模型,该模型可以找出注册用户购买某些计划或不购买计划(即,将只使用免费计划或不做任何事情)的概率,以及如果他们购买,在什么时间之后.我有大约 13 000 行的数据,其中大约 12 000 是免费用户(从未付费 - 0 值),另外 1 000 是在一段时间(从 1 到 690 天)后付费的,我也有一些计数和分类数据 - 国家,用户客户数量,他使用计划的次数,计划(高级,免费,高级加).

I need to make a model which could find probability that a registered user will buy some plan or no plan (i.e., will use just a free plan or won't do anything) and if they do, after what time. I have data with around 13 000 rows and around 12 000 of them are free users ( never paid - 0 value ) and the other 1 000 paid after some time ( from 1 to 690 days) also I have some count and categorical data - country, number of user clients, how many times he used plan, plan (premium, free, premium plus).

他们支付与否后的平均时间约为 6.37,方差为 1801.17,没有零 - 100 和 19012,这表明我应该使用负二项式模型.

The mean of time after they paid or not is around 6.37 and variance is 1801.17, without zeros - 100 and 19012, which suggests to me that I should use a negative binomial model.

但我不确定哪种模型最适合;我在考虑零膨胀负二项式或障碍模型.

But I'm not sure which model fits best; I'm thinking about a zero-inflated negative binomial or hurdle model.

这里是有 0 和没有 0 数据的 diff.time 直方图:

Here is histogram of diff.time with 0 and without 0 data :

我使用 pscl 包尝试了这些模型:

I tried these models with the pscl package:

summary(m1 <- zeroinfl(diff.time3 ~ 
    factor(Registration.country) + factor(Plan) + Campaigns.sent + 
         Number.of.subscribers |
    factor(Registration.country) + factor(Plan) + Campaigns.sent + 
         Number.of.subscribers, 
data=df , link="logit",dist= "negbin"))

或与 hurdle() 相同但他们给了我一个错误:

or the same with hurdle() but they gave me an error :

quantile.default(x$residuals) 中的错误:如果 'na.rm' 为 FALSE,则不允许缺少值和 NaN 另外:警告消息:glm.fit:算法未收敛

Error in quantile.default(x$residuals): missing values and NaN's not allowed if 'na.rm' is FALSE In addition: Warning message: glm.fit: algorithm did not converge

使用hurdle():

solve.default(as.matrix(fit_count$hessian)) 中的错误:Lapack 例程 dgesv:系统完全是奇异的:U[3,3] = 0

Error in solve.default(as.matrix(fit_count$hessian)) : Lapack routine dgesv: system is exactly singular: U[3,3] = 0

我以前从未尝试过这些模型,所以我不确定如何修复这些错误,或者我是否选择了正确的模型.

I have never tried these models before so I'm not sure how to fix these errors or if I chose the right models.

很遗憾,我没有机会分享我的部分数据,但我会尽力解释:

Unfortunately, I have no opportunuty to share some part of my data, but I'll try to explain them:

第一列计划" - 大部分数据是免费"(大约 12 000),还有赚取更多"、高级"或高级试用",其中免费"和高级试用"不付费.第二列使用的计划" - 大约 8 000 行是 0, 1 000 - 1, 3 000 - 从 1 到 10 和另外 1 000 从 10 到 510第 3 列客户"描述了用户拥有的客户数量 - 大约 2 000 人拥有 0、4 0000 - 1-10、3 000 - 10-200、2 000-200-1000、2 000 - 1000-340 000第 4 列注册国家"——36 个不同的国家,超过一半的数据是美国,其他有 5 到几百行.第 5 列是 diff.time,它应该是我的因变量,正如我之前所说的,大多数数据都是 0(12 000),其他的则是 1 天到 690 天)

1st column "plan" - most of the data are "free"(around 12 000), also "Earning more", "Premium" or "Premium trial", where "free" and "premium trial" are not paid. 2nd column "Plan used" - around 8 000 rows are 0, 1 000 - 1, 3 000 - from 1 to 10 and another 1 000 from 10 to 510 3th column "Clients" describes how many clients user have - around 2 000 have 0, 4 0000 - 1-10, 3 000 - 10-200, 2 000- 200-1000, 2 000 - 1000- 340 000 4th column "registration country" - 36 different countries, over half of data is united states, other have from 5 to few hundreds rows. 5th column is diff.time which should be my dependent variable, as I said before most of the data are 0 (12 000) and others variuos from 1 day to 690 days)

推荐答案

如果您的实际数据的结构与您发布的数据相似,那么您在估计与您指定的模型类似的模型时会遇到问题.我们先来看看您在 Google Drive 上发布的数据:

If your actual data is similarly structured to the data you posted then you will have problems estimating a model like the one you specified. Let's first have a look at the data you posted on the Google drive:

load("duom.Rdata")
table(a$diff.time3 > 0)
## FALSE  TRUE 
##   950    50 

因此,响应中有一些变化,但不是很多.您只有 5% 的非零值,总共 50 个观察值.仅从这些信息来看,估计障碍部分(零与非零)的偏差减少二进制模型 (brglm) 似乎更合理.

Thus there is some variation in the response but not a lot. You have only 5% non-zeros, overall 50 observations. From this information alone it might seem more reasonable to estimate a bias-reduced binary model (brglm) to the hurdle part (zero vs. non-zero).

对于零截断计数部分,您可能可以拟合模型,但您需要注意要包含哪些效果,因为只有 50 个自由度.您可以使用 R-Forge 提供的 countreg 包中的 zerotrunc 函数估计障碍模型的零截断部分.

For the zero-truncated count part you can possibly fit a model but you need to be careful which effects you want to include because there are only 50 degrees of freedom. You can estimate the zero-truncated part of the hurdle model using the zerotrunc function in package countreg, available from R-Forge.

你也应该清理你的因素.通过在公式中重新应用 factor 函数,可以排除出现次数为零的水平.但也有只出现一次的级别,您将无法获得有意义的结果.

Also you should clean up your factors. By re-applying the factor function within the formula, levels with zero occurrences are excluded. But there are also levels with only one occurrence for which you will not get meaningful results.

table(factor(a$Plan))
## Earning much more              Free           Mailing           Premium 
##                 1               950                 1                24 
##     Premium trial 
##                24 
table(factor(a$Registration.country))
##  australia  Australia    Austria Bangladesh    Belgium     brasil     Brasil 
##          1        567          7          5         56          1         53 
##   Bulgaria     Canada 
##         10        300 

此外,您需要使用所有小写字母清理国家/地区级别.

Also, you need to clean up the country levels with all lower-case letters.

之后,我将开始为零和非零构建二进制 GLM - 并根据这些结果继续零截断计数部分.

After that I would start out by buidling a binary GLM for zero vs. non-zero - and based on those results continue with the zero-truncated count part.

这篇关于用 R 制作零膨胀或障碍模型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆