带插入符号的机器学习:如何指定超时? [英] Machine learning with caret: How to specify a timeout?

查看:44
本文介绍了带插入符号的机器学习:如何指定超时?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在使用 caret 库中的 train 在 R 中训练模型时,是否可以指定超时?如果没有,是否存在包装代码并且可以在一定时间后终止的 R 构造?

解决方案

Caret 选项使用 trainControl() 对象.它没有用于指定超时时间的参数.

trainControl() 中对运行时性能影响最大的两个设置是 method=number=.caret 中的默认方法是 boot,或引导.引导方法的默认 number 是 25,除非 method="cv".

因此,使用插入符号运行的 randomForest 将进行 25 次引导样本迭代,这是一个非常缓慢的过程,尤其是在单个处理器线程上运行时.

强制超时

R 函数可以通过 R.utils 包中的 >withTimeout() 函数.

例如,我们将使用 mtcars 数据集通过插入符号运行随机森林,并执行 500 次引导抽样迭代,以使 train() 运行时间超过 15 秒.我们将使用 withTimeout() 在 15 秒的 CPU 时间后停止处理.

数据(mtcars)图书馆(随机森林)图书馆(R.utils)图书馆(插入符号)fitControl <- trainControl(method = "boot",数量 = 500,allowParallel = FALSE)超时(theModel <- train(mpg ~ .,data=mtcars,method="rf",trControl=fitControl),超时=15)

...以及输出的第一部分:

<代码>>超时(+ theModel <- train(mpg ~ .,data=mtcars,method="rf",trControl=fitControl)+ ,超时=15)[2018-05-19 07:32:37] TimeoutException: 任务 2 失败 - 达到经过时间限制" [cpu=15s, elapsed=15s]

提高caret性能

除了简单地超时caret::train()函数,我们可以使用两种技术来提高caret::train()的性能,并行处理以及对 trainControl() 参数的调整.

  1. 编写 R 脚本以使用并行处理需要 paralleldoParallel() 包,并且是一个多步骤过程.
  2. method="boot" 更改为 method="cv"(k 折交叉验证)并将 number= 减少为 35 将显着提高 caret::train() 的运行时性能.

总结我之前在提高随机森林的性能中描述的技术使用 caret::train(),以下代码使用 Sonar 数据集来实现与 caretrandomForest 的并行处理.

<预><代码>## 插入符号文档中的声纳示例#图书馆(mlbench)library(randomForest) # varImpPlot 需要数据(声纳)## 查看Class列的分布#表(声纳$类)图书馆(插入符号)set.seed(95014)# 创建训练 &测试数据集inTraining <- createDataPartition(Sonar$Class, p = .75, list=FALSE)培训 <- 声纳 [inTraining,]测试 <- 声纳 [-inTraining,]## 第一步:配置并行处理#图书馆(并行)库(doParallel)cluster <- makeCluster(detectCores() - 1) # 约定为操作系统保留 1 个核心registerDoParallel(集群)## 第 2 步:配置 trainControl() 对象进行 k 折交叉验证# 5 折#fitControl <- trainControl(method = "cv",数字 = 5,允许并行 = TRUE)## Step 3: 开发训练模型#system.time(fit <- train(Class ~ ., method="rf",data=Sonar,trControl = fitControl))## Step 4: 注销集群#停止集群(集群)registerDoSEQ()## 第 5 步:评估模型拟合#合身适合$重新采样混淆Matrix.train(适合)#来自最终模型的平均 OOB 错误意思是(适合$finalModel$err.rate[,OOB"])情节(拟合,主要=预测数的准确性")varImpPlot(fit$finalModel,main="变量重要性图:随机森林")会话信息()

Is it possible to specify a timeout when training a model in R using trainfrom the caret library? If not, does a R construct exist that wraps the code and can be terminated after a certain amount of time?

解决方案

Caret options are configured with the trainControl() object. It does not have a parameter to specify a timeout period.

The two settings in trainControl() that make the most impact on runtime performance are method= and number=. The default method in caret is boot, or bootstrapping. The default number for the bootstrapping method is 25 unless method="cv".

Therefore, a randomForest run with caret will conduct 25 iterations of bootstrap samples, a very slow process, especially if run on a single processor thread.

Forcing a timeout

R functions can be given a timeout period via the withTimeout() function from the R.utils package.

For example, we'll run a random forest via caret with the mtcars data set, and execute 500 iterations of bootstrap sampling to get train() to run longer than 15 seconds. We will use withTimeout() to stop processing after 15 seconds of CPU time.

data(mtcars)
library(randomForest)
library(R.utils)
library(caret)
fitControl <- trainControl(method = "boot",
                           number = 500,
                           allowParallel = FALSE)

withTimeout(
     theModel <- train(mpg ~ .,data=mtcars,method="rf",trControl=fitControl)
     ,timeout=15)

...and the first part of the output:

> withTimeout(
+      theModel <- train(mpg ~ .,data=mtcars,method="rf",trControl=fitControl)
+      ,timeout=15)
[2018-05-19 07:32:37] TimeoutException: task 2 failed - "reached elapsed time limit" [cpu=15s, elapsed=15s]

Improving caret performance

Aside from simply timing out the caret::train() function, we can use two techniques to improve the performance of caret::train(), parallel processing and adjustments to the trainControl() parameters.

  1. Coding an R script to use parallel processing requires the parallel and doParallel() packages, and is a multi-step process.
  2. Changing method="boot" to method="cv" (k-fold cross validation) and reducing number= to 3 or 5 will significantly improve the runtime performance of caret::train().

Summarizing techniques I previously described in Improving Performance of Random Forest with caret::train(), the following code uses the Sonar data set to implement parallel processing with caret and randomForest.

#
# Sonar example from caret documentation
#

library(mlbench)
library(randomForest) # needed for varImpPlot
data(Sonar)
#
# review distribution of Class column
# 
table(Sonar$Class)
library(caret)
set.seed(95014)

# create training & testing data sets

inTraining <- createDataPartition(Sonar$Class, p = .75, list=FALSE)
training <- Sonar[inTraining,]
testing <- Sonar[-inTraining,]

#
# Step 1: configure parallel processing
# 

library(parallel)
library(doParallel)
cluster <- makeCluster(detectCores() - 1) # convention to leave 1 core for OS 
registerDoParallel(cluster)

#
# Step 2: configure trainControl() object for k-fold cross validation with
#         5 folds
#

fitControl <- trainControl(method = "cv",
                           number = 5,
                           allowParallel = TRUE)

#
# Step 3: develop training model
#

system.time(fit <- train(Class ~ ., method="rf",data=Sonar,trControl = fitControl))

#
# Step 4: de-register cluster
#
stopCluster(cluster)
registerDoSEQ()
#
# Step 5: evaluate model fit 
#
fit
fit$resample
confusionMatrix.train(fit)
#average OOB error from final model
mean(fit$finalModel$err.rate[,"OOB"])

plot(fit,main="Accuracy by Predictor Count")
varImpPlot(fit$finalModel,
           main="Variable Importance Plot: Random Forest")
sessionInfo()

这篇关于带插入符号的机器学习:如何指定超时?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆