R中的xgboost:xgb.cv如何将最佳参数传递给xgb.train [英] xgboost in R: how does xgb.cv pass the optimal parameters into xgb.train

查看:359
本文介绍了R中的xgboost:xgb.cv如何将最佳参数传递给xgb.train的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在探索R中的xgboost包,并进行了一些演示和教程,但这仍然让我感到困惑:使用xgb.cv进行交叉验证后,最佳参数如何传递给?还是应该基于xgb.cv的输出计算理想参数(例如nroundmax.depth)?

I've been exploring the xgboost package in R and went through several demos as well as tutorials but this still confuses me: after using xgb.cv to do cross validation, how does the optimal parameters get passed to xgb.train? Or should I calculate the ideal parameters (such as nround, max.depth) based on the output of xgb.cv?

param <- list("objective" = "multi:softprob",
              "eval_metric" = "mlogloss",
              "num_class" = 12)
cv.nround <- 11
cv.nfold <- 5
mdcv <-xgb.cv(data=dtrain,params = param,nthread=6,nfold = cv.nfold,nrounds = cv.nround,verbose = T)

md <-xgb.train(data=dtrain,params = param,nround = 80,watchlist = list(train=dtrain,test=dtest),nthread=6)

推荐答案

就像您误解了xgb.cv一样,它不是参数搜索功能.它做k折交叉验证,仅此而已.

Looks like you misunderstood xgb.cv, it is not a parameter searching function. It does k-folds cross validation, nothing more.

在您的代码中,它不会更改param的值.

In your code, it does not change the value of param.

要在R的XGBoost中找到最佳参数,有一些方法.这是2种方法,

To find best parameters in R's XGBoost, there are some methods. These are 2 methods,

(1)使用mlr包, http://mlr-org.github.io /mlr-tutorial/release/html/

有一个XGBoost + mlr 示例代码,在Kaggle的审慎挑战赛中,

There is a XGBoost + mlr example code in the Kaggle's Prudential challenge,

但是该代码用于回归,而不是分类.据我所知,在mlr包中还没有mlogloss度量标准,因此您必须从头开始编写mlogloss测量值. CMIIW.

But that code is for regression, not classification. As far as I know, there is no mlogloss metric yet in mlr package, so you must code the mlogloss measurement from scratch by yourself. CMIIW.

(2)第二种方法,通过手动设置参数然后重复,例如

(2) Second method, by manually setting the parameters then repeat, example,

param <- list(objective = "multi:softprob",
      eval_metric = "mlogloss",
      num_class = 12,
      max_depth = 8,
      eta = 0.05,
      gamma = 0.01, 
      subsample = 0.9,
      colsample_bytree = 0.8, 
      min_child_weight = 4,
      max_delta_step = 1
      )
cv.nround = 1000
cv.nfold = 5
mdcv <- xgb.cv(data=dtrain, params = param, nthread=6, 
                nfold=cv.nfold, nrounds=cv.nround,
                verbose = T)

然后,您找到最好的(最小)loglog

Then, you find the best (minimum) mlogloss,

min_logloss = min(mdcv[, test.mlogloss.mean])
min_logloss_index = which.min(mdcv[, test.mlogloss.mean])

min_logloss是mlogloss的最小值,而min_logloss_index是索引(整数).

min_logloss is the minimum value of mlogloss, while min_logloss_index is the index (round).

您必须多次重复上述过程,每次手动更改参数(mlr会为您重复一次).最终,您将获得最佳的全局最小值min_logloss.

You must repeat the process above several times, each time change the parameters manually (mlr does the repeat for you). Until finally you get best global minimum min_logloss.

注意:您可以循环执行100或200次迭代,在该循环中,您可以为每个迭代随机设置参数值.这样,您必须将最佳的[parameters_list, min_logloss, min_logloss_index]保存在变量或文件中.

Note: You can do it in a loop of 100 or 200 iterations, in which for each iteration you set the parameters value randomly. This way, you must save the best [parameters_list, min_logloss, min_logloss_index] in variables or in a file.

注意:最好通过set.seed()可重复的结果设置随机种子.不同的随机种子产生不同的结果.因此,必须将[parameters_list, min_logloss, min_logloss_index, seednumber]保存在变量或文件中.

Note: better to set random seed by set.seed() for reproducible result. Different random seed yields different result. So, you must save [parameters_list, min_logloss, min_logloss_index, seednumber] in the variables or file.

说最终您将在3次迭代/重复中得到3个结果:

Say that finally you get 3 results in 3 iterations/repeats:

min_logloss = 2.1457, min_logloss_index = 840
min_logloss = 2.2293, min_logloss_index = 920
min_logloss = 1.9745, min_logloss_index = 780

然后,您必须使用第三个参数(全局最小值min_logloss1.9745).您最好的指数(约)是780.

Then you must use the third parameters (it has global minimum min_logloss of 1.9745). Your best index (nrounds) is 780.

一旦获得最佳参数,就可以在培训中使用它,

Once you get best parameters, use it in the training,

# best_param is global best param with minimum min_logloss
# best_min_logloss_index is the global minimum logloss index
nround = 780
md <- xgb.train(data=dtrain, params=best_param, nrounds=nround, nthread=6)

我认为您在培训中不需要watchlist,因为您已经进行了交叉验证.但是,如果您仍然想使用watchlist,就可以了.

I don't think you need watchlist in the training, because you have done the cross validation. But if you still want to use watchlist, it is just okay.

更好的是,您可以在xgb.cv中使用尽早停止.

Even better you can use early stopping in xgb.cv.

mdcv <- xgb.cv(data=dtrain, params=param, nthread=6, 
                nfold=cv.nfold, nrounds=cv.nround,
                verbose = T, early.stop.round=8, maximize=FALSE)

使用此代码,当mlogloss值没有以8步减小时,xgb.cv将停止.您可以节省时间.您必须将maximize设置为FALSE,因为您期望最小的mlogloss.

With this code, when mlogloss value is not decreasing in 8 steps, the xgb.cv will stop. You can save time. You must set maximize to FALSE, because you expect minimum mlogloss.

这是一个示例代码,具有100次迭代循环和随机选择的参数.

Here is an example code, with 100 iterations loop, and random chosen parameters.

best_param = list()
best_seednumber = 1234
best_logloss = Inf
best_logloss_index = 0

for (iter in 1:100) {
    param <- list(objective = "multi:softprob",
          eval_metric = "mlogloss",
          num_class = 12,
          max_depth = sample(6:10, 1),
          eta = runif(1, .01, .3),
          gamma = runif(1, 0.0, 0.2), 
          subsample = runif(1, .6, .9),
          colsample_bytree = runif(1, .5, .8), 
          min_child_weight = sample(1:40, 1),
          max_delta_step = sample(1:10, 1)
          )
    cv.nround = 1000
    cv.nfold = 5
    seed.number = sample.int(10000, 1)[[1]]
    set.seed(seed.number)
    mdcv <- xgb.cv(data=dtrain, params = param, nthread=6, 
                    nfold=cv.nfold, nrounds=cv.nround,
                    verbose = T, early.stop.round=8, maximize=FALSE)

    min_logloss = min(mdcv[, test.mlogloss.mean])
    min_logloss_index = which.min(mdcv[, test.mlogloss.mean])

    if (min_logloss < best_logloss) {
        best_logloss = min_logloss
        best_logloss_index = min_logloss_index
        best_seednumber = seed.number
        best_param = param
    }
}

nround = best_logloss_index
set.seed(best_seednumber)
md <- xgb.train(data=dtrain, params=best_param, nrounds=nround, nthread=6)

使用此代码,您将使用随机参数运行交叉验证100次.然后,您将获得最佳的参数集,即在最小min_logloss的迭代中.

With this code, you run cross validation 100 times, each time with random parameters. Then you get best parameter set, that is in the iteration with minimum min_logloss.

增大early.stop.round的值,以防发现它太小(停止得太早).您还需要根据数据特征更改随机参数值的限制.

Increase the value of early.stop.round in case you find out that it's too small (too early stopping). You need also to change the random parameter values' limit based on your data characteristics.

而且,对于100或200次迭代,我想您想将verbose更改为FALSE.

And, for 100 or 200 iterations, I think you want to change verbose to FALSE.

附带说明:这是随机方法的示例,您可以对其进行调整,例如通过贝叶斯优化获得更好的方法.如果您使用的是XGBoost的Python版本,那么有一个很好的XGBoost超参数脚本, https://github.com/mpearmain/BayesBoost 使用贝叶斯优化来搜索最佳参数集.

Side note: That is example of random method, you can adjust it e.g. by Bayesian optimization for better method. If you have Python version of XGBoost, there is a good hyperparameter script for XGBoost, https://github.com/mpearmain/BayesBoost to search for best parameters set using Bayesian optimization.

我想添加第三种手动方法,由Kaggle管理员"Davut Polat"在

I want to add 3rd manual method, posted by "Davut Polat" a Kaggle master, in the Kaggle forum.

如果您了解Python和sklearn,也可以使用 GridSearchCV 以及xgboost.XGBClassifier或xgboost.XGBRegressor

If you know Python and sklearn, you can also use GridSearchCV along with xgboost.XGBClassifier or xgboost.XGBRegressor

这篇关于R中的xgboost:xgb.cv如何将最佳参数传递给xgb.train的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆