带插入符号的机器学习:如何指定超时? [英] Machine learning with caret: How to specify a timeout?
问题描述
在使用 caret
库中的 train
在 R 中训练模型时,是否可以指定超时?如果没有,是否存在包装代码并且可以在一定时间后终止的 R 构造?
Caret 选项使用 trainControl()
对象.它没有用于指定超时时间的参数.
trainControl()
中对运行时性能影响最大的两个设置是 method=
和 number=
.caret 中的默认方法是 boot
,或引导.引导方法的默认 number
是 25,除非 method="cv"
.
因此,使用插入符号运行的 randomForest
将进行 25 次引导样本迭代,这是一个非常缓慢的过程,尤其是在单个处理器线程上运行时.
强制超时
R 函数可以通过 R.utils 包中的 >withTimeout()
函数.
例如,我们将使用 mtcars 数据集通过插入符号运行随机森林,并执行 500 次引导抽样迭代,以使 train()
运行时间超过 15 秒.我们将使用 withTimeout()
在 15 秒的 CPU 时间后停止处理.
数据(mtcars)图书馆(随机森林)图书馆(R.utils)图书馆(插入符号)fitControl <- trainControl(method = "boot",数量 = 500,allowParallel = FALSE)超时(theModel <- train(mpg ~ .,data=mtcars,method="rf",trControl=fitControl),超时=15)
...以及输出的第一部分:
<代码>>超时(+ theModel <- train(mpg ~ .,data=mtcars,method="rf",trControl=fitControl)+ ,超时=15)[2018-05-19 07:32:37] TimeoutException: 任务 2 失败 - 达到经过时间限制" [cpu=15s, elapsed=15s]
提高caret
性能
除了简单地超时caret::train()
函数,我们可以使用两种技术来提高caret::train()
的性能,并行处理以及对 trainControl()
参数的调整.
- 编写 R 脚本以使用并行处理需要
parallel
和doParallel()
包,并且是一个多步骤过程. - 将
method="boot"
更改为method="cv"
(k 折交叉验证)并将number=
减少为3
或5
将显着提高caret::train()
的运行时性能.
总结我之前在提高随机森林的性能中描述的技术使用 caret::train(),以下代码使用 Sonar
数据集来实现与 caret
和 randomForest
的并行处理.
Is it possible to specify a timeout when training a model in R using train
from the caret
library?
If not, does a R construct exist that wraps the code and can be terminated after a certain amount of time?
Caret options are configured with the trainControl()
object. It does not have a parameter to specify a timeout period.
The two settings in trainControl()
that make the most impact on runtime performance are method=
and number=
. The default method in caret is boot
, or bootstrapping. The default number
for the bootstrapping method is 25 unless method="cv"
.
Therefore, a randomForest
run with caret will conduct 25 iterations of bootstrap samples, a very slow process, especially if run on a single processor thread.
Forcing a timeout
R functions can be given a timeout period via the withTimeout()
function from the R.utils
package.
For example, we'll run a random forest via caret with the mtcars data set, and execute 500 iterations of bootstrap sampling to get train()
to run longer than 15 seconds. We will use withTimeout()
to stop processing after 15 seconds of CPU time.
data(mtcars)
library(randomForest)
library(R.utils)
library(caret)
fitControl <- trainControl(method = "boot",
number = 500,
allowParallel = FALSE)
withTimeout(
theModel <- train(mpg ~ .,data=mtcars,method="rf",trControl=fitControl)
,timeout=15)
...and the first part of the output:
> withTimeout(
+ theModel <- train(mpg ~ .,data=mtcars,method="rf",trControl=fitControl)
+ ,timeout=15)
[2018-05-19 07:32:37] TimeoutException: task 2 failed - "reached elapsed time limit" [cpu=15s, elapsed=15s]
Improving caret
performance
Aside from simply timing out the caret::train()
function, we can use two techniques to improve the performance of caret::train()
, parallel processing and adjustments to the trainControl()
parameters.
- Coding an R script to use parallel processing requires the
parallel
anddoParallel()
packages, and is a multi-step process. - Changing
method="boot"
tomethod="cv"
(k-fold cross validation) and reducingnumber=
to3
or5
will significantly improve the runtime performance ofcaret::train()
.
Summarizing techniques I previously described in Improving Performance of Random Forest with caret::train(), the following code uses the Sonar
data set to implement parallel processing with caret
and randomForest
.
#
# Sonar example from caret documentation
#
library(mlbench)
library(randomForest) # needed for varImpPlot
data(Sonar)
#
# review distribution of Class column
#
table(Sonar$Class)
library(caret)
set.seed(95014)
# create training & testing data sets
inTraining <- createDataPartition(Sonar$Class, p = .75, list=FALSE)
training <- Sonar[inTraining,]
testing <- Sonar[-inTraining,]
#
# Step 1: configure parallel processing
#
library(parallel)
library(doParallel)
cluster <- makeCluster(detectCores() - 1) # convention to leave 1 core for OS
registerDoParallel(cluster)
#
# Step 2: configure trainControl() object for k-fold cross validation with
# 5 folds
#
fitControl <- trainControl(method = "cv",
number = 5,
allowParallel = TRUE)
#
# Step 3: develop training model
#
system.time(fit <- train(Class ~ ., method="rf",data=Sonar,trControl = fitControl))
#
# Step 4: de-register cluster
#
stopCluster(cluster)
registerDoSEQ()
#
# Step 5: evaluate model fit
#
fit
fit$resample
confusionMatrix.train(fit)
#average OOB error from final model
mean(fit$finalModel$err.rate[,"OOB"])
plot(fit,main="Accuracy by Predictor Count")
varImpPlot(fit$finalModel,
main="Variable Importance Plot: Random Forest")
sessionInfo()
这篇关于带插入符号的机器学习:如何指定超时?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!