如何在R包XGBoost中为xgb.cv指定训练和测试索引 [英] how to specify train and test indices for xgb.cv in R package XGBoost

查看:365
本文介绍了如何在R包XGBoost中为xgb.cv指定训练和测试索引的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近在 xgb.cv 中发现了关于 folds 参数的信息,该参数允许指定以下内容的索引验证集。然后在 xgb.cv 内调用辅助函数 xgb.cv.mknfold ,然后为每个函数获取其余索引fold作为相应折叠的训练集的索引。

I recently found out about the folds parameter in xgb.cv, which allows one to specify the indices of the validation set. The helper function xgb.cv.mknfold is then invoked within xgb.cv, which then takes the remaining indices for each fold to be the indices of the training set for the respective fold.

问题:我可以通过xgboost接口中的任何接口指定训练索引和验证索引吗?

Question: Can I specify both the training and validation indices via any interfaces in the xgboost interface?

我的主要动机是执行时间序列交叉验证,我不希望将非验证索引自动分配为训练数据。举例说明我想做什么:

My primary motivation is performing time-series cross validation, and I do not want the 'non-validation' indices to be automatically assigned as the training data. An example to illustrate what I want to do:

# assume i have 100 strips of time-series data, where each strip is X_i
# validate only on 10 points after training
fold1:  train on X_1-X_10, validate on X_11-X_20
fold2:  train on X_1-X_20, validate on X_21-X_30
fold3:  train on X_1-X_30, validate on X_31-X_40
...

当前,使用 folds 参数将迫使我将其余示例用作验证集,这将大大增加误差估计的方差,因为其余数据大大超过了训练数据并且可能与训练数据有很大不同,尤其是对于较早的褶皱而言。这就是我的意思:

Currently, using the folds parameter would force me to use the remaining examples as the validation set, which greatly increases the variance of the error estimate since the remaining data greatly outnumber the training data and may have a very different distribution from the training data especially for the earlier folds. Here's what I mean:

fold1:  train on X_1-X_10, validate on X_11-X100 # huge error
...

我愿意接受其他软件包的解决方案(如果方便)(即不需要我要撬开开放源代码),并且不要使原始xgboost实现中的效率无效。

I'm open to solutions from other packages if they are convenient (i.e. wouldn't require me to pry open source codes) and do not nullify the efficiencies in the original xgboost implementation.

推荐答案

我认为问题是绕错了方向,应该说:

I think the bottom part of the question is the wrong way round, should probably say:


迫使我将其余示例用作培训 set

似乎还提到了帮助函数 xgb.cv.mknfold 不在了。
请注意,我的 xgboost 版本是 0.71.2

It also seems that the mentioned helper function xgb.cv.mknfold is not around anymore. Note my version of xgboost is 0.71.2.

但是,似乎确实可以通过对 xgb.cv 进行小的修改就可以相当简单地实现,例如

However, it does seem that this could be achieved fairly straight-forward with a small modification of xgb.cv, e.g. something like:

xgb.cv_new <- function(params = list(), data, nrounds, nfold, label = NULL, 
          missing = NA, prediction = FALSE, showsd = TRUE, metrics = list(), 
          obj = NULL, feval = NULL, stratified = TRUE, folds = NULL, folds_train = NULL, 
          verbose = TRUE, print_every_n = 1L, early_stopping_rounds = NULL, 
          maximize = NULL, callbacks = list(), ...) {
  check.deprecation(...)
  params <- check.booster.params(params, ...)
  for (m in metrics) params <- c(params, list(eval_metric = m))
  check.custom.obj()
  check.custom.eval()
  if ((inherits(data, "xgb.DMatrix") && is.null(getinfo(data, 
                                                        "label"))) || (!inherits(data, "xgb.DMatrix") && is.null(label))) 
    stop("Labels must be provided for CV either through xgb.DMatrix, or through 'label=' when 'data' is matrix")
  if (!is.null(folds)) {
    if (!is.list(folds) || length(folds) < 2) 
      stop("'folds' must be a list with 2 or more elements that are vectors of indices for each CV-fold")
    nfold <- length(folds)
  }
  else {
    if (nfold <= 1) 
      stop("'nfold' must be > 1")
    folds <- generate.cv.folds(nfold, nrow(data), stratified, 
                               label, params)
  }
  params <- c(params, list(silent = 1))
  print_every_n <- max(as.integer(print_every_n), 1L)
  if (!has.callbacks(callbacks, "cb.print.evaluation") && verbose) {
    callbacks <- add.cb(callbacks, cb.print.evaluation(print_every_n, 
                                                       showsd = showsd))
  }
  evaluation_log <- list()
  if (!has.callbacks(callbacks, "cb.evaluation.log")) {
    callbacks <- add.cb(callbacks, cb.evaluation.log())
  }
  stop_condition <- FALSE
  if (!is.null(early_stopping_rounds) && !has.callbacks(callbacks, 
                                                        "cb.early.stop")) {
    callbacks <- add.cb(callbacks, cb.early.stop(early_stopping_rounds, 
                                                 maximize = maximize, verbose = verbose))
  }
  if (prediction && !has.callbacks(callbacks, "cb.cv.predict")) {
    callbacks <- add.cb(callbacks, cb.cv.predict(save_models = FALSE))
  }
  cb <- categorize.callbacks(callbacks)
  dall <- xgb.get.DMatrix(data, label, missing)
  bst_folds <- lapply(seq_along(folds), function(k) {
    dtest <- slice(dall, folds[[k]])
    if (is.null(folds_train))
      dtrain <- slice(dall, unlist(folds[-k]))
    else
      dtrain <- slice(dall, folds_train[[k]])
    handle <- xgb.Booster.handle(params, list(dtrain, dtest))
    list(dtrain = dtrain, bst = handle, watchlist = list(train = dtrain, 
                                                         test = dtest), index = folds[[k]])
  })
  rm(dall)
  basket <- list()
  num_class <- max(as.numeric(NVL(params[["num_class"]], 1)), 
                   1)
  num_parallel_tree <- max(as.numeric(NVL(params[["num_parallel_tree"]], 
                                          1)), 1)
  begin_iteration <- 1
  end_iteration <- nrounds
  for (iteration in begin_iteration:end_iteration) {
    for (f in cb$pre_iter) f()
    msg <- lapply(bst_folds, function(fd) {
      xgb.iter.update(fd$bst, fd$dtrain, iteration - 1, 
                      obj)
      xgb.iter.eval(fd$bst, fd$watchlist, iteration - 1, 
                    feval)
    })
    msg <- simplify2array(msg)
    bst_evaluation <- rowMeans(msg)
    bst_evaluation_err <- sqrt(rowMeans(msg^2) - bst_evaluation^2)
    for (f in cb$post_iter) f()
    if (stop_condition) 
      break
  }
  for (f in cb$finalize) f(finalize = TRUE)
  ret <- list(call = match.call(), params = params, callbacks = callbacks, 
              evaluation_log = evaluation_log, niter = end_iteration, 
              nfeatures = ncol(data), folds = folds)
  ret <- c(ret, basket)
  class(ret) <- "xgb.cv.synchronous"
  invisible(ret)
}

我刚刚添加了一个可选参数 folds_train = NULL ,然后以这种方式在函数内部使用了它(见上文):

I have just added an optional argument folds_train = NULL and used that later on inside the function in this way (see above):

if (is.null(folds_train))
  dtrain <- slice(dall, unlist(folds[-k]))
else
  dtrain <- slice(dall, folds_train[[k]])

然后您可以使用该函数的新版本,例如如下所示:

Then you can use the new version of the function, e.g. like below:

# save original version
orig <- xgboost::xgb.cv

# devtools::install_github("miraisolutions/godmode")
godmode:::assignAnywhere("xgb.cv", xgb.cv_new)

# now you can use (call) xgb.cv with the additional argument

# once you are done, or may want to switch back to the original version
# (if you restart R you will also be back to the original version):
godmode:::assignAnywhere("xgb.cv", orig)

因此,现在您应该可以使用额外的参数调用函数,为训练数据提供额外的索引。

So now you should be able to call the function with the extra argument, providing the additional indices for the training data.

请注意,我还没有时间测试这个。

这篇关于如何在R包XGBoost中为xgb.cv指定训练和测试索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆