是否有如何将数据集分成训练和验证集的原则进行的大拇指? [英] Is there a rule-of-thumb for how to divide a dataset into training and validation sets?

查看:103
本文介绍了是否有如何将数据集分成训练和验证集的原则进行的大拇指?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有一个如何更好地将数据分成训练和验证集的原则进行的大拇指?为偶数50/50为宜?或者是有更多的训练数据有明显的优势,相对于验证数据(反之亦然)?或者是这个选择非常依赖于应用程序?

Is there a rule-of-thumb for how to best divide data into training and validation sets? Is an even 50/50 split advisable? Or are there clear advantages of having more training data relative to validation data (or vice versa)? Or is this choice pretty much application dependent?

我已大多采用训练和验证数据,的80%/20%分别,但我选择这种划分而没有任何原则性原因.可有人谁是更有经验,在机器学习告诉我?

I have been mostly using an 80% / 20% of training and validation data, respectively, but I chose this division without any principled reason. Can someone who is more experienced in machine learning advise me?

推荐答案

有两个相互竞争的担忧:用更少的训练数据,您的参数估计值有较大的差异.用更少的测试数据,你的表现统计将有较大的变化.从广义上讲,你应该将数据,既不方差过高,这是比较有实例在每个类别中,而不是百分比的绝对数量做关注.

There are two competing concerns: with less training data, your parameter estimates have greater variance. With less testing data, your performance statistic will have greater variance. Broadly speaking you should be concerned with dividing data such that neither variance is too high, which is more to do with the absolute number of instances in each category rather than the percentage.

如果你有一个总的100个实例中,你可能坚持与交叉验证,因为没有单一的分裂将会给你在你的估计方差满意.如果你有10万分的情况下,它并不真的不管你是否选择一个80:20分或90:10分(实际上你可以选择使用较少的训练数据,如果你的方法,特别是计算密集型的).

If you have a total of 100 instances, you're probably stuck with cross validation as no single split is going to give you satisfactory variance in your estimates. If you have 100,000 instances, it doesn't really matter whether you choose an 80:20 split or a 90:10 split (indeed you may choose to use less training data if your method is particularly computationally intensive).

假设您有足够的数据来执行适当的保留测试数据(而不是交叉验证),则以下是一种指导处理差异的方法:

Assuming you have enough data to do proper held-out test data (rather than cross-validation), the following is an instructive way to get a handle on variances:

  1. 将数据分为训练和测试(80/20的确是一个很好的起点)
  2. 分割的训练数据分成训练和验证(再次,80/20是一个公平的分裂).
  3. 您的训练数据的子样本的随机选择,培训与这个分类,并记录验证集性能
  4. 尝试用不同量的训练数据的一系列试验:随机样本20%,比方说,10次和所述验证数据观察性能,然后做同样的40%,60%,80%.应该可以看到更多的数据二者更高的性能,而且还可以在不同的随机的样品降低方差
  5. 要得到方差的手柄由于测试数据的大小,在反向执行相同的过程.火车上的所有训练数据,那么随机抽样的百分比您的验证的数据多次,并观察效果.现在,您应该发现小数据验证数据的平均性能与所有验证数据的性能大致相同,但是在测试样本数量较少的情况下,方差要大得多.
  1. Split your data into training and testing (80/20 is indeed a good starting point)
  2. Split the training data into training and validation (again, 80/20 is a fair split).
  3. Subsample random selections of your training data, train the classifier with this, and record the performance on the validation set
  4. Try a series of runs with different amounts of training data: randomly sample 20% of it, say, 10 times and observe performance on the validation data, then do the same with 40%, 60%, 80%. You should see both greater performance with more data, but also lower variance across the different random samples
  5. To get a handle on variance due to the size of test data, perform the same procedure in reverse. Train on all of your training data, then randomly sample a percentage of your validation data a number of times, and observe performance. You should now find that the mean performance on small samples of your validation data is roughly the same as the performance on all the validation data, but the variance is much higher with smaller numbers of test samples

这篇关于是否有如何将数据集分成训练和验证集的原则进行的大拇指?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆