用户指定族的glm回归的模型公式 [英] Model formula for glm regression with user-specified family
问题描述
背景
我正在尝试预测某个产品系列的销售额(末尾示例中的y_test).其一段时间内的销售额是基于另一种产品(x_test)的所有先前销售额以及这些先前的销售额中有多少仍在使用中.尽管无法直接测量仍在使用中的那些先前已售出产品的数量,所以需要推断出生存曲线.
I am trying to predict the sales for a product line (y_test in the sample at the end). Its sales for a timeperiod are based on all previous sales of another product (x_test) and how many of those previous sales are still being used. It is not possible to directly measure the number of those previously-sold products that are still in use though, so a survival curve needs to be inferred.
例如,如果您制造特定型号智能手机的配件,则配件销售至少部分取决于仍在使用的那些智能手机的数量.(这不是功课,顺便说一句.)
For instance, if you make accessories for a particular smartphone model, the accessory sales are at least partly based on the number of those smartphones still being used. (This is not homework, BTW.)
详细信息
我有一些时间序列数据,想使用 glm
或类似的方法拟合回归模型.因变量和自变量之间的关系是这样的:
I have some time series data and would like to fit a regression model using glm
or something similar. The relationship between the dependent and independent variable is this:
其中p是时间段,y p 是因变量,x p 是自变量,c 0 和c 1 是回归系数,F t 是累积分布函数(例如 pgamma
),e p 是残留物.
Where p is the time period, yp is the dependent variable, xp is the independent variable, c0 and c1 are the regression coefficients, Ft is a cumulative distribution function (such as pgamma
), and ep are the residuals.
在前三个时间段内,该功能将扩展为以下形式:
Through the first three time periods, the function would expand to something like this:
#y[1] = c0 + c1*(x[1]*(1-integrate(function(q) {pgamma(q, c2, c2/c3)}, 0, 1)$value))
#y[2] = c0 + c1*(x[1]*(1-integrate(function(q) {pgamma(q, c2, c2/c3)}, 1, 2)$value) + x[2]*(1-integrate(function(q) {pgamma(q, c2, c2/c3)}, 0, 1)$value))
#y[3] = c0 + c1*(x[1]*(1-integrate(function(q) {pgamma(q, c2, c2/c3)}, 2, 3)$value) + x[2]*(1-integrate(function(q) {pgamma(q, c2, c2/c3)}, 1, 2)$value) + x[3]*(1-integrate(function(q) {pgamma(q, c2, c2/c3)}, 0, 1)$value))
所以,我有x p 和y p 的历史数据,我想获取系数/参数c 0的值,c 1 ,c 2 和c 3 ,可最大程度地减少残差.
So, I have historical data for xp and yp, and I want to get the values for the coefficients/parameters c0, c1, c2, and c3 that minimize the residuals.
我认为解决方案是使用 glm
并创建一个自定义系列,但是我不确定该怎么做.code> Gamma 系列,但距离还很远.我已经可以使用 nlminb
手动"进行优化,但是我更喜欢 glm
或类似功能.
I think the solution is to use glm
and create a custom family, but I'm not sure how to do that. I looked at the code for the Gamma
family but didn't get very far. I have been able to do the optimization "manually" using nlminb
, but I would much prefer the simplicity and usefulness (i.e. predict
and others) offered by glm
or similar functions.
以下是一些示例数据:
# Survival function (the integral part):
fsurv<-function(q, par) {
l<-length(q)
out<-vapply(1:l, function(i) {1-integrate(function(x) {pgamma(x, par[1], par[1]/par[2])}, q[i]-1, q[i])$value}, FUN.VALUE=0)
return(out)}
# Sum up the products:
frevsumprod <- function(x,y) {
l <- length(y)
out <- vapply(1:l, function(i) sum(x[1:i]*rev(y[1:i])), FUN.VALUE=0)
return(out)}
# Sample data:
p<-1:24 # Number of periods
x_test<-c(1188, 2742, 4132) # Sample data
y_test<-c(82520, 308910, 749395, 801905, 852310, 713935, 624170, 603960, 640660, 553600, 497775, 444140) # Sample data
c<-c(-50.161147,128.787437,0.817085,13.845487) # Coefficients and parameters, from another method that fit the data
# Pad the data to the correct length:
pad<-function(p,v,padval=0) {
l<-length(p)
padv<-l-length(v)
if(padv>0) (v<-c(v,rep(padval,padv)))
return(v)
}
x_test<-pad(p,x_test)
y_test<-pad(p,y_test,NA)
y_fitted<-c[0+1]+c[1+1]*frevsumprod(x_test,fsurv(p,c[(2:3)+1])) # Fitted values from regression
library(ggplot2)
ggplot(data.frame(p,y_test,y_fitted))+geom_point(aes(p,y_test))+geom_line(aes(p,y_fitted)) # Plot actual and fit
推荐答案
这不能通过 glm
完成. glm
中的 family
指定线性预测变量与y平均值的连接方式.请参见?family
和 Wiki .特别是,您将需要能够使用以下某些功能编写一个 family
列表:
This cannot be done with glm
. The family
in glm
specifies how the linear predictor is connected with the mean value of y. See ?family
and wiki. In particular, you would need to be able to write a family
list with (some of) the functions like:
> fam <- poisson()
> str(fam)
List of 12
$ family : chr "poisson"
$ link : chr "log"
$ linkfun :function (mu)
$ linkinv :function (eta)
$ variance :function (mu)
$ dev.resids:function (y, mu, wt)
$ aic :function (y, n, mu, wt, dev)
$ mu.eta :function (eta)
$ initialize: expression({ if (any(y < 0)) stop("negative values not allowed for the 'Poisson' family") n <- rep.int(1, nobs| __truncated__
$ validmu :function (mu)
$ valideta :function (eta)
$ simulate :function (object, nsim)
- attr(*, "class")= chr "family"
>
> fam <- Gamma()
> str(fam)
List of 12
$ family : chr "Gamma"
$ link : chr "inverse"
$ linkfun :function (mu)
$ linkinv :function (eta)
$ variance :function (mu)
$ dev.resids:function (y, mu, wt)
$ aic :function (y, n, mu, wt, dev)
$ mu.eta :function (eta)
$ initialize: expression({ if (any(y <= 0)) stop("non-positive values not allowed for the 'gamma' family") n <- rep.int(1, n| __truncated__
$ validmu :function (mu)
$ valideta :function (eta)
$ simulate :function (object, nsim)
- attr(*, "class")= chr "family"
其中 eta
是线性预测变量.IE.至少需要指定一个反向链接函数 linkinv
,该函数仅通过参数和协变量之间的点积依赖于协变量.您不会那样,因为它以非线性方式依赖于c_2和c_3.
where eta
refers to the linear predictor. I.e. at-least you will need to specify an inverse link function, linkinv
, which only depends on the co-variates through a dot product between the parameters and the co-variates. Yours does not as it depends on c_2 and c_3 in a non-linear way.
这篇关于用户指定族的glm回归的模型公式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!