XGBoost-具有变化的曝光/偏移的泊松分布 [英] XGBoost - Poisson distribution with varying exposure / offset

查看:405
本文介绍了XGBoost-具有变化的曝光/偏移的泊松分布的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用XGBoost对由不等长的曝光时间段生成的数据的声明频率进行建模,但是无法获得用于正确处理曝光时间的模型.通常,我可以通过将log(exposure)设置为偏移量来做到这一点-您可以在XGBoost中做到这一点吗?

I am trying to use XGBoost to model claims frequency of data generated from unequal length exposure periods, but have been unable to get the model to treat the exposure correctly. I would normally do this by setting log(exposure) as an offset - are you able to do this in XGBoost?

(在此处发布了类似的问题: xgboost,胶印曝光?)

(A similar question was posted here: xgboost, offset exposure?)

为说明问题,下面的R代码使用以下字段生成一些数据:

To illustrate the issue, the R code below generates some data with the fields:

  • x1,x2-因子(0或1)
  • 暴露-观察数据的保单期限
  • 频率-单位曝光的平均索赔数
  • 索赔-观察到的索赔数量〜泊松(频率*暴露)

目标是使用x1和x2预测频率-真正的模型是:如果x1 = x2 = 1,则频率= 2,否则,频率= 1.

The goal is to predict frequency using x1 and x2 - the true model is: frequency = 2 if x1 = x2 = 1, frequency = 1 otherwise.

由于一开始就无法得知曝光次数,因此无法用于预测该次数.我们可以使用的唯一方法是说:预期的索赔数量=频率*风险敞口.

Exposure can't be used to predict the frequency as it is not known at the outset of a policy. The only way we can use it is to say: expected number of claims = frequency * exposure.

代码尝试通过以下方式使用XGBoost对此进行预测:

The code tries to predict this using XGBoost by:

  1. 在模型矩阵中将曝光设置为权重
  2. 将日志(曝光)设置为偏移量

在这些下面,我展示了如何处理树(rpart)或gbm的情况.

Below these, I've shown how I would handle the situation for a tree (rpart) or gbm.

set.seed(1)
size<-10000
d <- data.frame(
  x1 = sample(c(0,1),size,replace=T,prob=c(0.5,0.5)),
  x2 = sample(c(0,1),size,replace=T,prob=c(0.5,0.5)),
  exposure = runif(size, 1, 10)*0.3
)
d$frequency <- 2^(d$x1==1 & d$x2==1)
d$claims <- rpois(size, lambda = d$frequency * d$exposure)

#### Try to fit using XGBoost
require(xgboost)
param0 <- list(
  "objective"  = "count:poisson"
  , "eval_metric" = "logloss"
  , "eta" = 1
  , "subsample" = 1
  , "colsample_bytree" = 1
  , "min_child_weight" = 1
  , "max_depth" = 2
)

## 1 - set weight in xgb.Matrix

xgtrain = xgb.DMatrix(as.matrix(d[,c("x1","x2")]), label = d$claims, weight = d$exposure)
xgb = xgb.train(
  nrounds = 1
  , params = param0
  , data = xgtrain
)

d$XGB_P_1 <- predict(xgb, xgtrain)

## 2 - set as offset in xgb.Matrix
xgtrain.mf  <- model.frame(as.formula("claims~x1+x2+offset(log(exposure))"),d)
xgtrain.m  <- model.matrix(attr(xgtrain.mf,"terms"),data = d)
xgtrain  <- xgb.DMatrix(xgtrain.m,label = d$claims)

xgb = xgb.train(
  nrounds = 1
  , params = param0
  , data = xgtrain
)

d$XGB_P_2 <- predict(model, xgtrain)

#### Fit a tree
require(rpart)
d[,"tree_response"] <- cbind(d$exposure,d$claims)
tree <- rpart(tree_response ~ x1 + x2,
              data = d,
              method = "poisson")

d$Tree_F <- predict(tree, newdata = d)

#### Fit a GBM

gbm <- gbm(claims~x1+x2+offset(log(exposure)), 
           data = d,
           distribution = "poisson",
           n.trees = 1,
           shrinkage=1,
           interaction.depth=2,
           bag.fraction = 0.5)

d$GBM_F <- predict(gbm, newdata = d, n.trees = 1, type="response")

推荐答案

我现在已经解决了如何使用setinfo将base_margin属性更改为偏移量(作为线性预测变量)的方法,例如:

I have now worked out how to do this using setinfo to change the base_margin attribute to be the offset (as a linear predictor), ie:

setinfo(xgtrain, "base_margin", log(d$exposure))

这篇关于XGBoost-具有变化的曝光/偏移的泊松分布的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆