是否有关于如何将数据集划分为训练集和验证集的经验法则? [英] Is there a rule-of-thumb for how to divide a dataset into training and validation sets?

查看:52
本文介绍了是否有关于如何将数据集划分为训练集和验证集的经验法则?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有关于如何最好地将数据划分为训练集和验证集的经验法则?建议采用 50/50 的平均分配吗?或者,相对于验证数据,拥有更多训练数据是否有明显的优势(反之亦然)?或者这个选择在很大程度上取决于应用程序?

Is there a rule-of-thumb for how to best divide data into training and validation sets? Is an even 50/50 split advisable? Or are there clear advantages of having more training data relative to validation data (or vice versa)? Or is this choice pretty much application dependent?

我主要分别使用了 80%/20% 的训练和验证数据,但我没有任何原则性原因选择了这种划分.在机器学习方面更有经验的人能给我建议吗?

I have been mostly using an 80% / 20% of training and validation data, respectively, but I chose this division without any principled reason. Can someone who is more experienced in machine learning advise me?

推荐答案

有两个相互矛盾的问题:训练数据越少,您的参数估计值方差越大.测试数据越少,您的性能统计数据将有更大的差异.从广义上讲,您应该关注划分数据,使方差都不会太高,这更多地与每个类别中的绝对实例数有关,而不是百分比.

There are two competing concerns: with less training data, your parameter estimates have greater variance. With less testing data, your performance statistic will have greater variance. Broadly speaking you should be concerned with dividing data such that neither variance is too high, which is more to do with the absolute number of instances in each category rather than the percentage.

如果您总共有 100 个实例,那么您可能会陷入交叉验证的困境,因为没有一个单独的拆分会给您的估计带来令人满意的差异.如果您有 100,000 个实例,那么选择 80:20 拆分还是 90:10 拆分并不重要(实际上,如果您的方法计算量特别大,您可以选择使用较少的训练数据).

If you have a total of 100 instances, you're probably stuck with cross validation as no single split is going to give you satisfactory variance in your estimates. If you have 100,000 instances, it doesn't really matter whether you choose an 80:20 split or a 90:10 split (indeed you may choose to use less training data if your method is particularly computationally intensive).

假设您有足够的数据来进行适当的保留测试数据(而不是交叉验证),以下是处理差异的有益方法:

Assuming you have enough data to do proper held-out test data (rather than cross-validation), the following is an instructive way to get a handle on variances:

  1. 将您的数据拆分为训练和测试(80/20 确实是一个很好的起点)
  2. 训练数据拆分为训练和验证数据(同样,80/20 是公平的拆分).
  3. 对训练数据进行子样本随机选择,以此训练分类器,并记录验证集上的表现
  4. 尝试使用不同数量的训练数据进行一系列运行:随机抽取其中的 20%,例如 10 次并观察验证数据的性能,然后对 40%、60%、80% 执行相同操作.您应该会看到更多数据带来的更高性能,以及不同随机样本之间的更低方差
  5. 要处理由于测试数据大小而引起的差异,请反向执行相同的过程.对您的所有训练数据进行训练,然后随机抽取一定百分比的验证数据多次,并观察性能.您现在应该会发现验证数据的小样本上的平均性能与所有验证数据上的性能大致相同,但测试样本数量较少时方差要高得多
  1. Split your data into training and testing (80/20 is indeed a good starting point)
  2. Split the training data into training and validation (again, 80/20 is a fair split).
  3. Subsample random selections of your training data, train the classifier with this, and record the performance on the validation set
  4. Try a series of runs with different amounts of training data: randomly sample 20% of it, say, 10 times and observe performance on the validation data, then do the same with 40%, 60%, 80%. You should see both greater performance with more data, but also lower variance across the different random samples
  5. To get a handle on variance due to the size of test data, perform the same procedure in reverse. Train on all of your training data, then randomly sample a percentage of your validation data a number of times, and observe performance. You should now find that the mean performance on small samples of your validation data is roughly the same as the performance on all the validation data, but the variance is much higher with smaller numbers of test samples

这篇关于是否有关于如何将数据集划分为训练集和验证集的经验法则?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆