如何在 R 包 XGBoost 中为 xgb.cv 指定训练和测试索引 [英] how to specify train and test indices for xgb.cv in R package XGBoost

查看:22
本文介绍了如何在 R 包 XGBoost 中为 xgb.cv 指定训练和测试索引的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近发现了 xgb.cv 中的 folds 参数,它允许指定验证集的索引.然后在 xgb.cv 中调用辅助函数 xgb.cv.mknfold,然后将每个折叠的剩余索引作为相应训练集的索引折叠.

I recently found out about the folds parameter in xgb.cv, which allows one to specify the indices of the validation set. The helper function xgb.cv.mknfold is then invoked within xgb.cv, which then takes the remaining indices for each fold to be the indices of the training set for the respective fold.

问题:我可以通过 xgboost 接口中的任何接口同时指定训练和验证索引吗?

Question: Can I specify both the training and validation indices via any interfaces in the xgboost interface?

我的主要动机是执行时间序列交叉验证,我不希望将非验证"索引自动分配为训练数据.一个例子来说明我想要做什么:

My primary motivation is performing time-series cross validation, and I do not want the 'non-validation' indices to be automatically assigned as the training data. An example to illustrate what I want to do:

# assume i have 100 strips of time-series data, where each strip is X_i
# validate only on 10 points after training
fold1:  train on X_1-X_10, validate on X_11-X_20
fold2:  train on X_1-X_20, validate on X_21-X_30
fold3:  train on X_1-X_30, validate on X_31-X_40
...

目前,使用 folds 参数会迫使我使用剩余的示例作为验证集,这会大大增加误差估计的方差,因为剩余数据的数量大大超过了训练数据,并且可能有与训练数据的分布非常不同,尤其是对于较早的折叠.这就是我的意思:

Currently, using the folds parameter would force me to use the remaining examples as the validation set, which greatly increases the variance of the error estimate since the remaining data greatly outnumber the training data and may have a very different distribution from the training data especially for the earlier folds. Here's what I mean:

fold1:  train on X_1-X_10, validate on X_11-X100 # huge error
...

如果其他软件包方便(即不需要我撬开开源代码)并且不会破坏原始 xgboost 实现的效率,我愿意接受其他软件包的解决方案.

I'm open to solutions from other packages if they are convenient (i.e. wouldn't require me to pry open source codes) and do not nullify the efficiencies in the original xgboost implementation.

推荐答案

我认为问题的底部是错误的,应该说:

I think the bottom part of the question is the wrong way round, should probably say:

强迫我用剩下的例子作为训练

force me to use the remaining examples as the training set

上面提到的辅助函数 xgb.cv.mknfold 似乎也不再存在了.请注意我的 xgboost 版本是 0.71.2.

It also seems that the mentioned helper function xgb.cv.mknfold is not around anymore. Note my version of xgboost is 0.71.2.

然而,这似乎可以通过对 xgb.cv 的小修改来相当直接地实现,例如类似:

However, it does seem that this could be achieved fairly straight-forward with a small modification of xgb.cv, e.g. something like:

xgb.cv_new <- function(params = list(), data, nrounds, nfold, label = NULL, 
          missing = NA, prediction = FALSE, showsd = TRUE, metrics = list(), 
          obj = NULL, feval = NULL, stratified = TRUE, folds = NULL, folds_train = NULL, 
          verbose = TRUE, print_every_n = 1L, early_stopping_rounds = NULL, 
          maximize = NULL, callbacks = list(), ...) {
  check.deprecation(...)
  params <- check.booster.params(params, ...)
  for (m in metrics) params <- c(params, list(eval_metric = m))
  check.custom.obj()
  check.custom.eval()
  if ((inherits(data, "xgb.DMatrix") && is.null(getinfo(data, 
                                                        "label"))) || (!inherits(data, "xgb.DMatrix") && is.null(label))) 
    stop("Labels must be provided for CV either through xgb.DMatrix, or through 'label=' when 'data' is matrix")
  if (!is.null(folds)) {
    if (!is.list(folds) || length(folds) < 2) 
      stop("'folds' must be a list with 2 or more elements that are vectors of indices for each CV-fold")
    nfold <- length(folds)
  }
  else {
    if (nfold <= 1) 
      stop("'nfold' must be > 1")
    folds <- generate.cv.folds(nfold, nrow(data), stratified, 
                               label, params)
  }
  params <- c(params, list(silent = 1))
  print_every_n <- max(as.integer(print_every_n), 1L)
  if (!has.callbacks(callbacks, "cb.print.evaluation") && verbose) {
    callbacks <- add.cb(callbacks, cb.print.evaluation(print_every_n, 
                                                       showsd = showsd))
  }
  evaluation_log <- list()
  if (!has.callbacks(callbacks, "cb.evaluation.log")) {
    callbacks <- add.cb(callbacks, cb.evaluation.log())
  }
  stop_condition <- FALSE
  if (!is.null(early_stopping_rounds) && !has.callbacks(callbacks, 
                                                        "cb.early.stop")) {
    callbacks <- add.cb(callbacks, cb.early.stop(early_stopping_rounds, 
                                                 maximize = maximize, verbose = verbose))
  }
  if (prediction && !has.callbacks(callbacks, "cb.cv.predict")) {
    callbacks <- add.cb(callbacks, cb.cv.predict(save_models = FALSE))
  }
  cb <- categorize.callbacks(callbacks)
  dall <- xgb.get.DMatrix(data, label, missing)
  bst_folds <- lapply(seq_along(folds), function(k) {
    dtest <- slice(dall, folds[[k]])
    if (is.null(folds_train))
      dtrain <- slice(dall, unlist(folds[-k]))
    else
      dtrain <- slice(dall, folds_train[[k]])
    handle <- xgb.Booster.handle(params, list(dtrain, dtest))
    list(dtrain = dtrain, bst = handle, watchlist = list(train = dtrain, 
                                                         test = dtest), index = folds[[k]])
  })
  rm(dall)
  basket <- list()
  num_class <- max(as.numeric(NVL(params[["num_class"]], 1)), 
                   1)
  num_parallel_tree <- max(as.numeric(NVL(params[["num_parallel_tree"]], 
                                          1)), 1)
  begin_iteration <- 1
  end_iteration <- nrounds
  for (iteration in begin_iteration:end_iteration) {
    for (f in cb$pre_iter) f()
    msg <- lapply(bst_folds, function(fd) {
      xgb.iter.update(fd$bst, fd$dtrain, iteration - 1, 
                      obj)
      xgb.iter.eval(fd$bst, fd$watchlist, iteration - 1, 
                    feval)
    })
    msg <- simplify2array(msg)
    bst_evaluation <- rowMeans(msg)
    bst_evaluation_err <- sqrt(rowMeans(msg^2) - bst_evaluation^2)
    for (f in cb$post_iter) f()
    if (stop_condition) 
      break
  }
  for (f in cb$finalize) f(finalize = TRUE)
  ret <- list(call = match.call(), params = params, callbacks = callbacks, 
              evaluation_log = evaluation_log, niter = end_iteration, 
              nfeatures = ncol(data), folds = folds)
  ret <- c(ret, basket)
  class(ret) <- "xgb.cv.synchronous"
  invisible(ret)
}

我刚刚添加了一个可选参数 folds_train = NULL 并在稍后以这种方式在函数内部使用它(见上文):

I have just added an optional argument folds_train = NULL and used that later on inside the function in this way (see above):

if (is.null(folds_train))
  dtrain <- slice(dall, unlist(folds[-k]))
else
  dtrain <- slice(dall, folds_train[[k]])

然后你可以使用新版本的功能,例如如下:

Then you can use the new version of the function, e.g. like below:

# save original version
orig <- xgboost::xgb.cv

# devtools::install_github("miraisolutions/godmode")
godmode:::assignAnywhere("xgb.cv", xgb.cv_new)

# now you can use (call) xgb.cv with the additional argument

# once you are done, or may want to switch back to the original version
# (if you restart R you will also be back to the original version):
godmode:::assignAnywhere("xgb.cv", orig)

所以现在您应该能够使用额外的参数调用该函数,为训练数据提供额外的索引.

So now you should be able to call the function with the extra argument, providing the additional indices for the training data.

请注意,我没有时间对此进行测试.

这篇关于如何在 R 包 XGBoost 中为 xgb.cv 指定训练和测试索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆