使用 xgboost 和 caret 进行并行处理 [英] Parallel processing with xgboost and caret
问题描述
我想在使用插入符号时并行化 xgboost 的模型拟合过程.从我在 xgboost 的文档中看到的,nthread
参数控制在拟合模型时使用的线程数,也就是说,以并行方式构建树.Caret 的 train
函数将执行并行化,例如,在 k-fold CV 中为每次迭代运行一个进程.这种理解是否正确,如果是,是否更好:
I want to parallelize the model fitting process for xgboost while using caret. From what I have seen in xgboost's documentation, the nthread
parameter controls the number of threads to use while fitting the models, in the sense of, building the trees in a parallel way. Caret's train
function will perform parallelization in the sense of, for example, running a process for each iteration in a k-fold CV. Is this understanding correct, if yes, is it better to:
- 注册内核数量(例如,使用
doMC
包和registerDoMC
函数),通过插入符的train 设置nthread=1
函数,因此它将该参数传递给 xgboost,在trainControl
中设置allowParallel=TRUE
,并让caret
处理交叉验证的并行化;或 - 禁用插入符并行化(
allowParallel=FALSE
并且没有并行后端注册)并将nthread
设置为物理内核的数量,因此并行化仅包含在 xgboost 中.
- Register the number of cores (for example, with the
doMC
package and theregisterDoMC
function), setnthread=1
via caret's train function so it passes that parameter to xgboost, setallowParallel=TRUE
intrainControl
, and letcaret
handle the parallelization for the cross-validation; or - Disable caret parallelization (
allowParallel=FALSE
and no parallel back-end registration) and setnthread
to the number of physical cores, so the parallelization is contained exclusively within xgboost.
或者有没有更好"的方式来执行并行化?
我运行了@topepo建议的代码,使用tuneLength = 10
和search="random"
,并指定nthread=1
在最后一行(否则我知道 xgboost 将使用多线程).这是我得到的结果:
Or is there no "better" way to perform the parallelization?
I ran the code suggested by @topepo, with tuneLength = 10
and search="random"
, and specifying nthread=1
on the last line (otherwise I understand that xgboost will use multithreading). There are the results I got:
xgb_par[3]
elapsed
283.691
just_seq[3]
elapsed
276.704
mc_par[3]
elapsed
89.074
just_seq[3]/mc_par[3]
elapsed
3.106451
just_seq[3]/xgb_par[3]
elapsed
0.9753711
xgb_par[3]/mc_par[3]
elapsed
3.184891
最后,事实证明,对于我的数据和这个测试用例,就运行时而言,让插入符号处理并行化是更好的选择.
At the end, it turned out that both for my data and for this test case, letting caret handle the parallelization was a better choice in terms of runtime.
推荐答案
预测最佳策略并不简单.我的(有偏见的)想法是,您应该将耗时最长的过程并行化.在这里,这将是重采样循环,因为打开的线程/工作程序会多次调用模型.并行化模型拟合的相反方法将反复启动和停止工作人员,理论上会减慢速度.你的旅费可能会改变.
It is not simple to project what the best strategy would be. My (biased) thought is that you should parallelize the process that takes the longest. Here, that would be the resampling loop since an open thread/worker would invoke the model many times. The opposite approach of parallelizing the model fit will start and stop workers repeatedly and theoretically slows things down. Your mileage may vary.
我没有安装 OpenMP,但下面有代码可以测试(如果你能报告你的结果,那会很有帮助).
I don't have OpenMP installed but there is code below to test (if you could report your results, that would be helpful).
library(caret)
library(plyr)
library(xgboost)
library(doMC)
foo <- function(...) {
set.seed(2)
mod <- train(Class ~ ., data = dat,
method = "xgbTree", tuneLength = 50,
..., trControl = trainControl(search = "random"))
invisible(mod)
}
set.seed(1)
dat <- twoClassSim(1000)
just_seq <- system.time(foo())
## I don't have OpenMP installed
xgb_par <- system.time(foo(nthread = 5))
registerDoMC(cores=5)
mc_par <- system.time(foo())
我的结果(没有 OpenMP)
My results (without OpenMP)
> just_seq[3]
elapsed
326.422
> xgb_par[3]
elapsed
319.862
> mc_par[3]
elapsed
102.329
>
> ## Speedups
> xgb_par[3]/mc_par[3]
elapsed
3.12582
> just_seq[3]/mc_par[3]
elapsed
3.189927
> just_seq[3]/xgb_par[3]
elapsed
1.020509
这篇关于使用 xgboost 和 caret 进行并行处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!