用户指定族的glm回归的模型公式 [英] Model formula for glm regression with user-specified family

查看:62
本文介绍了用户指定族的glm回归的模型公式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

背景

我正在尝试预测某个产品系列的销售额(末尾示例中的y_test).其一段时间内的销售额是基于另一种产品(x_test)的所有先前销售额以及这些先前的销售额中有多少仍在使用中.尽管无法直接测量仍在使用中的那些先前已售出产品的数量,所以需要推断出生存曲线.

I am trying to predict the sales for a product line (y_test in the sample at the end). Its sales for a timeperiod are based on all previous sales of another product (x_test) and how many of those previous sales are still being used. It is not possible to directly measure the number of those previously-sold products that are still in use though, so a survival curve needs to be inferred.

例如,如果您制造特定型号智能手机的配件,则配件销售至少部分取决于仍在使用的那些智能手机的数量.(这不是功课,顺便说一句.)

For instance, if you make accessories for a particular smartphone model, the accessory sales are at least partly based on the number of those smartphones still being used. (This is not homework, BTW.)

详细信息

我有一些时间序列数据,想使用 glm 或类似的方法拟合回归模型.因变量和自变量之间的关系是这样的:

I have some time series data and would like to fit a regression model using glm or something similar. The relationship between the dependent and independent variable is this:

其中p是时间段,y p 是因变量,x p 是自变量,c 0 和c 1 是回归系数,F t 是累积分布函数(例如 pgamma ),e p 是残留物.

Where p is the time period, yp is the dependent variable, xp is the independent variable, c0 and c1 are the regression coefficients, Ft is a cumulative distribution function (such as pgamma), and ep are the residuals.

在前三个时间段内,该功能将扩展为以下形式:

Through the first three time periods, the function would expand to something like this:

#y[1] = c0 + c1*(x[1]*(1-integrate(function(q) {pgamma(q, c2, c2/c3)}, 0, 1)$value))
#y[2] = c0 + c1*(x[1]*(1-integrate(function(q) {pgamma(q, c2, c2/c3)}, 1, 2)$value) + x[2]*(1-integrate(function(q) {pgamma(q, c2, c2/c3)}, 0, 1)$value))
#y[3] = c0 + c1*(x[1]*(1-integrate(function(q) {pgamma(q, c2, c2/c3)}, 2, 3)$value) + x[2]*(1-integrate(function(q) {pgamma(q, c2, c2/c3)}, 1, 2)$value) + x[3]*(1-integrate(function(q) {pgamma(q, c2, c2/c3)}, 0, 1)$value))

所以,我有x p 和y p 的历史数据,我想获取系数/参数c 0的值,c 1 ,c 2 和c 3 ,可最大程度地减少残差.

So, I have historical data for xp and yp, and I want to get the values for the coefficients/parameters c0, c1, c2, and c3 that minimize the residuals.

我认为解决方案是使用 glm 并创建一个自定义系列,但是我不确定该怎么做.code> Gamma 系列,但距离还很远.我已经可以使用 nlminb 手动"进行优化,但是我更喜欢 glm 或类似功能.

I think the solution is to use glm and create a custom family, but I'm not sure how to do that. I looked at the code for the Gamma family but didn't get very far. I have been able to do the optimization "manually" using nlminb, but I would much prefer the simplicity and usefulness (i.e. predict and others) offered by glm or similar functions.

以下是一些示例数据:

# Survival function (the integral part):
fsurv<-function(q, par) {
  l<-length(q)
  out<-vapply(1:l, function(i) {1-integrate(function(x) {pgamma(x, par[1], par[1]/par[2])}, q[i]-1, q[i])$value}, FUN.VALUE=0)
  return(out)}

# Sum up the products:
frevsumprod <- function(x,y) {
  l <- length(y)
  out <- vapply(1:l, function(i) sum(x[1:i]*rev(y[1:i])), FUN.VALUE=0)
  return(out)}

# Sample data:
p<-1:24 # Number of periods
x_test<-c(1188, 2742, 4132) # Sample data
y_test<-c(82520, 308910, 749395, 801905, 852310, 713935, 624170, 603960, 640660, 553600, 497775, 444140) # Sample data
c<-c(-50.161147,128.787437,0.817085,13.845487) # Coefficients and parameters, from another method that fit the data

# Pad the data to the correct length:
pad<-function(p,v,padval=0) {
  l<-length(p)
  padv<-l-length(v)
  if(padv>0) (v<-c(v,rep(padval,padv)))
  return(v)
}
x_test<-pad(p,x_test)
y_test<-pad(p,y_test,NA)

y_fitted<-c[0+1]+c[1+1]*frevsumprod(x_test,fsurv(p,c[(2:3)+1])) # Fitted values from regression

library(ggplot2)
ggplot(data.frame(p,y_test,y_fitted))+geom_point(aes(p,y_test))+geom_line(aes(p,y_fitted)) # Plot actual and fit

推荐答案

这不能通过 glm 完成. glm 中的 family 指定线性预测变量与y平均值的连接方式.请参见?family Wiki .特别是,您将需要能够使用以下某些功能编写一个 family 列表:

This cannot be done with glm. The family in glm specifies how the linear predictor is connected with the mean value of y. See ?family and wiki. In particular, you would need to be able to write a family list with (some of) the functions like:

> fam <- poisson()
> str(fam)
List of 12
 $ family    : chr "poisson"
 $ link      : chr "log"
 $ linkfun   :function (mu)  
 $ linkinv   :function (eta)  
 $ variance  :function (mu)  
 $ dev.resids:function (y, mu, wt)  
 $ aic       :function (y, n, mu, wt, dev)  
 $ mu.eta    :function (eta)  
 $ initialize:  expression({  if (any(y < 0))  stop("negative values not allowed for the 'Poisson' family")  n <- rep.int(1, nobs| __truncated__
 $ validmu   :function (mu)  
 $ valideta  :function (eta)  
 $ simulate  :function (object, nsim)  
 - attr(*, "class")= chr "family"
> 
> fam <- Gamma()
> str(fam)
List of 12
 $ family    : chr "Gamma"
 $ link      : chr "inverse"
 $ linkfun   :function (mu)  
 $ linkinv   :function (eta)  
 $ variance  :function (mu)  
 $ dev.resids:function (y, mu, wt)  
 $ aic       :function (y, n, mu, wt, dev)  
 $ mu.eta    :function (eta)  
 $ initialize:  expression({  if (any(y <= 0))  stop("non-positive values not allowed for the 'gamma' family")  n <- rep.int(1, n| __truncated__
 $ validmu   :function (mu)  
 $ valideta  :function (eta)  
 $ simulate  :function (object, nsim)  
 - attr(*, "class")= chr "family"

其中 eta 是线性预测变量.IE.至少需要指定一个反向链接函数 linkinv ,该函数仅通过参数和协变量之间的点积依赖于协变量.您不会那样,因为它以非线性方式依赖于c_2和c_3.

where eta refers to the linear predictor. I.e. at-least you will need to specify an inverse link function, linkinv, which only depends on the co-variates through a dot product between the parameters and the co-variates. Yours does not as it depends on c_2 and c_3 in a non-linear way.

这篇关于用户指定族的glm回归的模型公式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆