使用验证、训练和测试集之间的顺序 [英] Order between using validation, training and test sets

查看:49
本文介绍了使用验证、训练和测试集之间的顺序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图了解机器学习中模型评估和验证的过程.具体来说,必须以何种顺序以及如何使用训练、验证和测试集.

I am trying to understand the process of model evaluation and validation in machine learning. Specifically, in which order and how the training, validation and test sets must be used.

假设我有一个数据集并且我想使用线性回归.我在各种多项式度(超参数)之间犹豫.

Let's say I have a dataset and I want to use linear regression. I am hesitating among various polynomial degrees (hyper-parameters).

这篇维基百科文章中,似乎暗示该序列应该是:

In this wikipedia article, it seems to imply that the sequence should be:

  1. 将数据拆分为训练集、验证集和测试集
  2. 使用训练集拟合模型(找到最佳参数:多项式的系数).
  3. 之后,使用验证集找到最佳超参数(在这种情况下,多项式次数)(维基百科文章说:随后,拟合模型用于预测在称为验证数据集的第二个数据集中的观察结果")
  4. 最后,使用测试集对装有训练集的模型进行评分.
  1. Split data into training set, validation set and test set
  2. Use the training set to fit the model (find the best parameters: coefficients of the polynomial).
  3. Afterwards, use the validation set to find the best hyper-parameters (in this case, polynomial degree) (wikipedia article says: "Successively, the fitted model is used to predict the responses for the observations in a second dataset called the validation dataset")
  4. Finally, use the test set to score the model fitted with the training set.

然而,这对我来说似乎很奇怪:如果你还没有选择你的超参数(在这种情况下是多项式次数),你怎么能用训练集来拟合你的模型?

However, this seems strange to me: how can you fit your model with the training set if you haven't chosen yet your hyper-parameters (polynomial degree in this case)?

我看到了三种替代方法,我不确定它们是否正确.

I see three alternative approachs, I am not sure if they would be correct.

  1. 将数据拆分为训练集、验证集和测试集
  2. 对于每个多项式次数,用训练集拟合模型,并使用验证集给它打分.
  3. 对于得分最高的多项式次数,将模型与训练集进行拟合.
  4. 使用测试集进行评估
  1. Split data into training set, validation set and test set
  2. For each polynomial degree, fit the model with the training set and give it a score using the validation set.
  3. For the polynomial degree with the best score, fit the model with the training set.
  4. Evaluate with the test set

第二种方法

  1. 将数据拆分为训练集、验证集和测试集
  2. 对于每个多项式次数,仅在验证集上使用交叉验证来拟合和评分模型
  3. 对于得分最高的多项式次数,将模型与训练集进行拟合.
  4. 使用测试集进行评估

第三种方法

  1. 将数据分成仅两组:训练/验证集和测试集
  2. 对于每个多项式次数,仅在训练/验证集上使用交叉验证来拟合和评分模型
  3. 对于得分最高的多项式次数,将模型与训练/验证集进行拟合.
  4. 使用测试集进行评估
  1. Split data into only two sets: the training/validation set and the test set
  2. For each polynomial degree, use cross-validation only on the training/validation set to fit and score the model
  3. For the polynomial degree with the best score, fit the model with the training/validation set.
  4. Evaluate with the test set

所以问题是:

  • 维基百科文章是错误的还是我遗漏了什么?
  • 我设想的三种方法是否正确?哪一个会更可取?还有比这三种更好的方法吗?

推荐答案

维基百科的意思实际上是你的第一种方法.

What Wikipedia means is actually your first approach.

1 将数据拆分为训练集、验证集和测试集

1 Split data into training set, validation set and test set

2 使用训练集以拟合模型(找到最佳参数:系数多项式的).

2 Use the training set to fit the model (find the best parameters: coefficients of the polynomial).

这只是意味着您使用训练数据来拟合模型.

That just means that you use your training data to fit a model.

3 之后,使用验证集寻找最佳超参数(在这种情况下,多项式次数)(维基百科文章说:随后,拟合模型用于预测响应称为验证数据集的第二个数据集中的观察结果")

3 Afterwards, use the validation set to find the best hyper-parameters (in this case, polynomial degree) (wikipedia article says: "Successively, the fitted model is used to predict the responses for the observations in a second dataset called the validation dataset")

这意味着您使用验证数据集通过先前(在训练集上)训练的模型来预测其值,以获得模型在未见数据上的表现的分数.

That means that you use your validation dataset to predict its values with the previously (on the training set) trained model to get a score of how good your model performs on unseen data.

您对要查看的所有超参数组合(在您的情况下是您要尝试的不同多项式次数)重复步骤 2 和 3,以获得每个超参数组合的分数(例如准确度).

You repeat step 2 and 3 for all hyperparameter combinations you want to look at (in your case the different polynomial degrees you want to try) to get a score (e.g. accuracy) for every hyperparmeter combination.

最后,使用测试集对训练好的模型进行评分设置.

Finally, use the test set to score the model fitted with the training set.

为什么你需要验证集在这个 stackexchange question 中有很好的解释https://datascience.stackexchange.com/questions/18339/为什么使用两者验证集和测试集

Why you need the validation set is pretty well explained in this stackexchange question https://datascience.stackexchange.com/questions/18339/why-use-both-validation-set-and-test-set

最后,您可以使用三种方法中的任何一种.

In the end you can use any of your three aproaches.

  1. 方法:

是最快的,因为您只为每个超参数训练一个模型.您也不需要像其他两个一样多的数据.

is the fastest because you only train one model for every hyperparameter. also you don't need as much data as for the other two.

方法:

最慢,因为你训练 k 折 k 个分类器加上最后一个用所有训练数据来验证每个超参数组合.

is slowest because you train for k folds k classifiers plus the final one with all your training data to validate it for every hyperparameter combination.

您还需要大量数据,因为您将数据拆分了三次,并将第一部分再次拆分为 k 折.

You also need a lot of data because you split your data three times and that first part again in k folds.

但在这里,您的结果差异最小.它不太可能巧合地获得 k 个好的分类器和一个好的验证结果.这在第一种方法中更有可能发生.交叉验证也不太可能过度拟合.

But here you have the least variance in your results. Its pretty unlikely to get k good classifiers and a good validation result by coincidence. That could happen more likely in the first approach. Cross Validation is also way more unlikely to overfit.

方法:

介于其他两者之间的优缺点.在这里,您也不太可能过度拟合.

is in its pros and cons in between of the other two. Here you also have less likely overfitting.

最终这将取决于您拥有多少数据,以及您是否进入神经网络等更复杂的模型,您拥有并愿意花费多少时间/计算能力.

In the end it will depend on how much data you have and if you get into more complex models like neural networks, how much time/calculationpower you have and are willing to spend.

编辑 正如@desertnaut 提到的:请记住,您应该使用训练集和验证集作为训练数据,用于对测试集的评估.此外,您在第二种方法中将训练与验证集混淆了.

Edit As @desertnaut mentioned: Keep in mind that you should use training- and validationset as training data for your evaluation with the test set. Also you confused training with validation set in your second approach.

这篇关于使用验证、训练和测试集之间的顺序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆