使用 xgboost 和 caret 进行并行处理 [英] Parallel processing with xgboost and caret

查看:93
本文介绍了使用 xgboost 和 caret 进行并行处理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在使用插入符号时并行化 xgboost 的模型拟合过程.从我在 xgboost 的文档中看到的,nthread 参数控制在拟合模型时使用的线程数,也就是说,以并行方式构建树.Caret 的 train 函数将执行并行化,例如,在 k-fold CV 中为每次迭代运行一个进程.这种理解是否正确,如果是,是否更好:

I want to parallelize the model fitting process for xgboost while using caret. From what I have seen in xgboost's documentation, the nthread parameter controls the number of threads to use while fitting the models, in the sense of, building the trees in a parallel way. Caret's train function will perform parallelization in the sense of, for example, running a process for each iteration in a k-fold CV. Is this understanding correct, if yes, is it better to:

  1. 注册内核数量(例如,使用doMC 包和registerDoMC 函数),通过插入符的train 设置nthread=1函数,因此它将该参数传递给 xgboost,在 trainControl 中设置 allowParallel=TRUE,并让 caret 处理交叉验证的并行化;或
  2. 禁用插入符并行化(allowParallel=FALSE 并且没有并行后端注册)并将 nthread 设置为物理内核的数量,因此并行化仅包含在 xgboost 中.
  1. Register the number of cores (for example, with the doMC package and the registerDoMC function), set nthread=1 via caret's train function so it passes that parameter to xgboost, set allowParallel=TRUE in trainControl, and let caret handle the parallelization for the cross-validation; or
  2. Disable caret parallelization (allowParallel=FALSE and no parallel back-end registration) and set nthread to the number of physical cores, so the parallelization is contained exclusively within xgboost.

或者有没有更好"的方式来执行并行化?

我运行了@topepo建议的代码,使用tuneLength = 10search="random",并指定nthread=1 在最后一行(否则我知道 xgboost 将使用多线程).这是我得到的结果:

Or is there no "better" way to perform the parallelization?

I ran the code suggested by @topepo, with tuneLength = 10 and search="random", and specifying nthread=1 on the last line (otherwise I understand that xgboost will use multithreading). There are the results I got:

xgb_par[3]
elapsed  
283.691 
just_seq[3]
elapsed 
276.704 
mc_par[3]
elapsed 
89.074 
just_seq[3]/mc_par[3]
elapsed 
3.106451 
just_seq[3]/xgb_par[3]
elapsed 
0.9753711 
xgb_par[3]/mc_par[3]
elapsed 
3.184891

最后,事实证明,对于我的数据和这个测试用例,就运行时而言,让插入符号处理并行化是更好的选择.

At the end, it turned out that both for my data and for this test case, letting caret handle the parallelization was a better choice in terms of runtime.

推荐答案

预测最佳策略并不简单.我的(有偏见的)想法是,您应该将耗时最长的过程并行化.在这里,这将是重采样循环,因为打开的线程/工作程序会多次调用模型.并行化模型拟合的相反方法将反复启动和停止工作人员,理论上会减慢速度.你的旅费可能会改变.

It is not simple to project what the best strategy would be. My (biased) thought is that you should parallelize the process that takes the longest. Here, that would be the resampling loop since an open thread/worker would invoke the model many times. The opposite approach of parallelizing the model fit will start and stop workers repeatedly and theoretically slows things down. Your mileage may vary.

我没有安装 OpenMP,但下面有代码可以测试(如果你能报告你的结果,那会很有帮助).

I don't have OpenMP installed but there is code below to test (if you could report your results, that would be helpful).

library(caret)
library(plyr)
library(xgboost)
library(doMC)

foo <- function(...) {
  set.seed(2)
  mod <- train(Class ~ ., data = dat, 
               method = "xgbTree", tuneLength = 50,
               ..., trControl = trainControl(search = "random"))
  invisible(mod)
}

set.seed(1)
dat <- twoClassSim(1000)

just_seq <- system.time(foo())


## I don't have OpenMP installed
xgb_par <- system.time(foo(nthread = 5))

registerDoMC(cores=5)
mc_par <- system.time(foo())

我的结果(没有 OpenMP)

My results (without OpenMP)

> just_seq[3]
elapsed 
326.422 
> xgb_par[3]
elapsed 
319.862 
> mc_par[3]
elapsed 
102.329 
> 
> ## Speedups
> xgb_par[3]/mc_par[3]
elapsed 
3.12582 
> just_seq[3]/mc_par[3]
 elapsed 
3.189927 
> just_seq[3]/xgb_par[3]
 elapsed 
1.020509 

这篇关于使用 xgboost 和 caret 进行并行处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆