使用插入符号包和R绘制学习曲线 [英] Plot learning curves with caret package and R

查看:199
本文介绍了使用插入符号包和R绘制学习曲线的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想研究模型调整的偏差/方差之间的最佳权衡.我将插入符号用于R,这使我能够针对模型的超参数(mtry,lambda等)绘制性能指标(AUC,准确性...),并自动选择最大值.通常这会返回一个好的模型,但是如果我想进一步挖掘并选择其他偏差/方差折衷方案,则需要学习曲线,而不是性能曲线.

I would like to study the optimal tradeoff between bias/variance for model tuning. I'm using caret for R which allows me to plot the performance metric (AUC, accuracy...) against the hyperparameters of the model (mtry, lambda, etc.) and automatically chooses the max. This typically returns a good model, but if I want to dig further and choose a different bias/variance tradeoff I need a learning curve, not a performance curve.

为简单起见,假设我的模型是一个随机森林,它只有一个超参数"mtry"

For the sake of simplicity, let's say my model is a random forest, which has just one hyperparameter 'mtry'

我想绘制训练和测试集的学习曲线.像这样:

I would like to plot the learning curves of both training and test sets. Something like this:

(红色曲线是测试集)

在y轴上,我放置了一个错误度量(错误分类的示例数或类似的数目);在x轴上"mtry"或训练集大小.

On the y axis I put an error metric (number of misclassified examples or something like that); on the x axis 'mtry' or alternatively the training set size.

问题:

  1. 插入式功能是否具有基于大小不同的训练集折叠迭代训练模型的功能?如果必须手动编码,该怎么办?

  1. Has caret the functionality to iteratively train models based of training set folds different in size? If I have to code by hand, how can I do that?

如果我想将超参数放在x轴上,则需要用caret :: train训练的所有模型,而不仅仅是最终模型(在CV之后获得最大性能的模型).这些丢弃的"模型在训练后仍然可用吗?

If I want to put the hyperparameter on the x axis, I need all the models trained by caret::train, not just the final model (the one with maximum performance got after CV). Are these "discarded" model still available after train?

推荐答案

    如果您设置了
  1. Caret,它将为您迭代测试很多简历模型 trainControl()函数和使用tuneGrid()的参数(例如mtry). 然后将这两个都作为控制选项传递给train() 功能.每个参数的tuneGrid参数(例如mtry,ntree)的细节将有所不同 模型类型.

  1. Caret will iteratively test lots of cv models for you if you set the trainControl() function and the parameters (e.g. mtry) using a tuneGrid(). Both of these are then passed as control options to the train() function. The specifics of the tuneGrid parameters (e.g. mtry, ntree) will be different for each model type.

是的,最终的trainFit模型将包含CV所有折叠的错误率(无论您如何指定).

Yes the final trainFit model will contain the error rate (however you specified it) for all folds of your CV.

因此您可以指定10倍CV乘以具有10个mtry值的网格-这将是100次迭代.您可能想去喝杯茶或午餐.

So you could specify e.g. a 10-fold CV times a grid with 10 values of mtry -which would be 100 iterations. You might want to go get a cup of tea or possibly lunch.

如果这听起来很复杂...

If this sounds complicated ... there is a very good example here - caret being one of the best documented packages about.

这篇关于使用插入符号包和R绘制学习曲线的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆