使用插入符号包和 R 绘制学习曲线 [英] Plot learning curves with caret package and R
问题描述
我想研究模型调整的偏差/方差之间的最佳权衡.我正在为 R 使用插入符号,它允许我针对模型的超参数(mtry、lambda 等)绘制性能指标(AUC、准确度...)并自动选择最大值.这通常会返回一个好的模型,但如果我想进一步挖掘并选择不同的偏差/方差权衡,我需要一个学习曲线,而不是性能曲线.
I would like to study the optimal tradeoff between bias/variance for model tuning. I'm using caret for R which allows me to plot the performance metric (AUC, accuracy...) against the hyperparameters of the model (mtry, lambda, etc.) and automatically chooses the max. This typically returns a good model, but if I want to dig further and choose a different bias/variance tradeoff I need a learning curve, not a performance curve.
为了简单起见,假设我的模型是一个随机森林,它只有一个超参数mtry"
For the sake of simplicity, let's say my model is a random forest, which has just one hyperparameter 'mtry'
我想绘制训练集和测试集的学习曲线.像这样:
I would like to plot the learning curves of both training and test sets. Something like this:
(红色曲线为测试集)
在 y 轴上我放置了一个错误度量(错误分类示例的数量或类似的东西);在 x 轴上的mtry"或训练集大小.
On the y axis I put an error metric (number of misclassified examples or something like that); on the x axis 'mtry' or alternatively the training set size.
问题:
caret 是否具有根据不同大小的训练集折叠迭代训练模型的功能?如果我必须手动编码,我该怎么做?
Has caret the functionality to iteratively train models based of training set folds different in size? If I have to code by hand, how can I do that?
如果我想将超参数放在 x 轴上,我需要通过 caret::train 训练的所有模型,而不仅仅是最终模型(在 CV 之后获得的性能最高的模型).这些废弃"的模型在训练后还能用吗?
If I want to put the hyperparameter on the x axis, I need all the models trained by caret::train, not just the final model (the one with maximum performance got after CV). Are these "discarded" model still available after train?
推荐答案
这是我的代码,介绍了我如何在使用 Caret
时处理在 R
中绘制学习曲线的问题> 用于训练模型的包.我在 R 中使用 Motor Trend Car Road Tests
进行说明.首先,我将 mtcars
数据集随机化并拆分为训练集和测试集.21 条训练记录和 13 条测试集记录.在这个例子中,响应特性是 mpg
.
Here's my code on how I approached this issue of plotting a learning curve in R
while using the Caret
package to train your model. I use the Motor Trend Car Road Tests
in R for illustrative purposes. To begin, I randomize and split the mtcars
dataset into training and test sets. 21 records for training and 13 records for the test set. The response feature is mpg
in this example.
# set seed for reproducibility
set.seed(7)
# randomize mtcars
mtcars <- mtcars[sample(nrow(mtcars)),]
# split iris data into training and test sets
mtcarsIndex <- createDataPartition(mtcars$mpg, p = .625, list = F)
mtcarsTrain <- mtcars[mtcarsIndex,]
mtcarsTest <- mtcars[-mtcarsIndex,]
# create empty data frame
learnCurve <- data.frame(m = integer(21),
trainRMSE = integer(21),
cvRMSE = integer(21))
# test data response feature
testY <- mtcarsTest$mpg
# Run algorithms using 10-fold cross validation with 3 repeats
trainControl <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "RMSE"
# loop over training examples
for (i in 3:21) {
learnCurve$m[i] <- i
# train learning algorithm with size i
fit.lm <- train(mpg~., data=mtcarsTrain[1:i,], method="lm", metric=metric,
preProc=c("center", "scale"), trControl=trainControl)
learnCurve$trainRMSE[i] <- fit.lm$results$RMSE
# use trained parameters to predict on test data
prediction <- predict(fit.lm, newdata = mtcarsTest[,-1])
rmse <- postResample(prediction, testY)
learnCurve$cvRMSE[i] <- rmse[1]
}
pdf("LinearRegressionLearningCurve.pdf", width = 7, height = 7, pointsize=12)
# plot learning curves of training set size vs. error measure
# for training set and test set
plot(log(learnCurve$trainRMSE),type = "o",col = "red", xlab = "Training set size",
ylab = "Error (RMSE)", main = "Linear Model Learning Curve")
lines(log(learnCurve$cvRMSE), type = "o", col = "blue")
legend('topright', c("Train error", "Test error"), lty = c(1,1), lwd = c(2.5, 2.5),
col = c("red", "blue"))
dev.off()
输出图如下:
The output plot is as shown below:
这篇关于使用插入符号包和 R 绘制学习曲线的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!