在使用验证,培训和测试集之间的顺序 [英] Order between using validation, training and test sets

查看:442
本文介绍了在使用验证,培训和测试集之间的顺序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图了解机器学习中模型评估和验证的过程.具体来说,必须以何种顺序以及如何使用训练,验证和测试集.

I am trying to understand the process of model evaluation and validation in machine learning. Specifically, in which order and how the training, validation and test sets must be used.

比方说,我有一个数据集,我想使用线性回归.我在各种多项式(超参数)之间犹豫不决.

Let's say I have a dataset and I want to use linear regression. I am hesitating among various polynomial degrees (hyper-parameters).

这篇Wikipedia文章中,这似乎暗示该顺序应该是:

In this wikipedia article, it seems to imply that the sequence should be:

  1. 将数据分为训练集,验证集和测试集
  2. 使用训练集拟合模型(找到最佳参数:多项式的系数).
  3. 然后,使用验证集找到最佳的超参数(在这种情况下为多项式)(维基百科文章说:成功地,使用拟合模型来预测对第二个数据集(称为验证数据集)中的观测值")
  4. 最后,使用测试集对符合训练集的模型进行评分.
  1. Split data into training set, validation set and test set
  2. Use the training set to fit the model (find the best parameters: coefficients of the polynomial).
  3. Afterwards, use the validation set to find the best hyper-parameters (in this case, polynomial degree) (wikipedia article says: "Successively, the fitted model is used to predict the responses for the observations in a second dataset called the validation dataset")
  4. Finally, use the test set to score the model fitted with the training set.

但是,这对我来说似乎很奇怪:如果尚未选择超参数(在这种情况下为多项式),那么如何将模型与训练集拟合?

However, this seems strange to me: how can you fit your model with the training set if you haven't chosen yet your hyper-parameters (polynomial degree in this case)?

我看到了三种替代方法,我不确定它们是否正确.

I see three alternative approachs, I am not sure if they would be correct.

  1. 将数据分为训练集,验证集和测试集
  2. 对于每个多项式,使用训练集拟合模型 ,并使用验证集为其评分.
  3. 对于得分最高的多项式,将模型与训练集拟合.
  4. 使用测试集进行评估
  1. Split data into training set, validation set and test set
  2. For each polynomial degree, fit the model with the training set and give it a score using the validation set.
  3. For the polynomial degree with the best score, fit the model with the training set.
  4. Evaluate with the test set

第二种方法

  1. 将数据分为训练集,验证集和测试集
  2. 对于每个多项式,仅在验证集上使用交叉验证 即可拟合模型并为其打分
  3. 对于得分最高的多项式,将模型与训练集拟合.
  4. 使用测试集进行评估
  1. Split data into training set, validation set and test set
  2. For each polynomial degree, use cross-validation only on the validation set to fit and score the model
  3. For the polynomial degree with the best score, fit the model with the training set.
  4. Evaluate with the test set

第三种方法

  1. 仅将数据分为两套:训练/验证套和测试套
  2. 对于每个多项式度,仅在训练/验证集上使用交叉验证 来拟合模型并为其打分
  3. 对于得分最高的多项式,使用训练/验证集拟合模型.
  4. 使用测试集进行评估
  1. Split data into only two sets: the training/validation set and the test set
  2. For each polynomial degree, use cross-validation only on the training/validation set to fit and score the model
  3. For the polynomial degree with the best score, fit the model with the training/validation set.
  4. Evaluate with the test set

所以问题是:

  • 维基百科的文章是否错误或我缺少什么?
  • 我设想的三种方法正确吗?哪一个更可取?有没有比这三种更好的方法?

推荐答案

Wikipedia的意思实际上是您的第一种方法.

What Wikipedia means is actually your first approach.

1将数据分为训练集,验证集和测试集

1 Split data into training set, validation set and test set

2使用 训练集以适合模型(找到最佳参数:系数 的多项式).

2 Use the training set to fit the model (find the best parameters: coefficients of the polynomial).

那只是意味着您使用训练数据来拟合模型.

That just means that you use your training data to fit a model.

3之后,使用验证集找到最佳的超参数 (在这种情况下,是多项式)(维基百科文章说: 随后,拟合模型用于预测 第二个数据集中的观测值称为验证数据集)

3 Afterwards, use the validation set to find the best hyper-parameters (in this case, polynomial degree) (wikipedia article says: "Successively, the fitted model is used to predict the responses for the observations in a second dataset called the validation dataset")

这意味着您可以使用验证数据集,使用先前(在训练集上)经过训练的模型来预测其值,从而获得模型在看不见的数据上的表现得分.

That means that you use your validation dataset to predict its values with the previously (on the training set) trained model to get a score of how good your model performs on unseen data.

对于要查看的所有超参数组合(如果要使用的是不同的多项式),请重复第2步和第3步,以获取每个超参数组合的得分(例如准确度).

You repeat step 2 and 3 for all hyperparameter combinations you want to look at (in your case the different polynomial degrees you want to try) to get a score (e.g. accuracy) for every hyperparmeter combination.

最后,使用测试集对适合训练的模型进行评分 设置.

Finally, use the test set to score the model fitted with the training set.

在这个stackexchange问​​题中很好地解释了为什么需要验证集 https://datascience.stackexchange.com/questions/18339/为什么同时使用验证集和测试集

Why you need the validation set is pretty well explained in this stackexchange question https://datascience.stackexchange.com/questions/18339/why-use-both-validation-set-and-test-set

最后,您可以使用三个方法中的任何一个.

In the end you can use any of your three aproaches.

  1. 方法:

  1. approach:

是最快的,因为您只需为每个超参数训练一个模型. 同样,您也不需要其他两个数据.

is the fastest because you only train one model for every hyperparameter. also you don't need as much data as for the other two.

方法:

最慢,因为您要训练k倍k个分类器以及最后一个带有所有训练数据的分类器,以针对每种超参数组合进行验证.

is slowest because you train for k folds k classifiers plus the final one with all your training data to validate it for every hyperparameter combination.

您还需要大量数据,因为您将数据拆分了三遍,第一部分又以k倍折叠.

You also need a lot of data because you split your data three times and that first part again in k folds.

但是这里您的结果差异最小.巧合地获得k个好的分类器和好的验证结果几乎是不可能的.第一种方法更有可能发生这种情况.交叉验证也不太可能过拟合.

But here you have the least variance in your results. Its pretty unlikely to get k good classifiers and a good validation result by coincidence. That could happen more likely in the first approach. Cross Validation is also way more unlikely to overfit.

方法:

的优缺点是其他两个之间的优缺点.在这里,过拟合的可能性也较小.

is in its pros and cons in between of the other two. Here you also have less likely overfitting.

最后,这取决于您拥有多少数据,以及是否进入更复杂的模型(例如神经网络),拥有和愿意花费多少时间/计算能力.

In the end it will depend on how much data you have and if you get into more complex models like neural networks, how much time/calculationpower you have and are willing to spend.

编辑如@desertnaut所述:请记住,您应该将training-和validateset用作训练数据,以便对测试集进行评估.另外,您在第二种方法中将训练与验证设置混淆了.

Edit As @desertnaut mentioned: Keep in mind that you should use training- and validationset as training data for your evaluation with the test set. Also you confused training with validation set in your second approach.

这篇关于在使用验证,培训和测试集之间的顺序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆