是否有任何技术可以预先知道使深度学习获得良好表现所需的训练示例数量? [英] Is there any technique to know in advance the amount of training examples you need to make deep learning get good performance?

查看:70
本文介绍了是否有任何技术可以预先知道使深度学习获得良好表现所需的训练示例数量?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

深度学习是最近的一场革命,它的成功与我们当前可以管理的大量数据以及GPU的推广有关。



是我面临的问题。我知道深层神经网络具有最佳性能,对此毫无疑问。但是,当训练示例数量巨大时,它们具有良好的性能。如果训练示例的数量较少,则最好使用SVM或决策树。



但是,什么是巨大的呢?什么低?在这张



他们使用了1亿到2亿个训练示例,这是巨大的。



我的问题是:
是否有任何方法可以预先预测训练示例我需要在深度学习中表现出色???我之所以这样问,是因为如果性能不好的话,手动对数据集进行分类是浪费时间。

解决方案


我的问题是:是否有任何方法可以预先预测我需要在深度学习中表现出色的训练示例的数量???我之所以这样问,是因为如果性能不好的话,手动对数据集进行分类是浪费时间。


简短的回答是。您没有这种知识,此外,您永远不会拥有。这些问题永远都无法解决。



您可以得到的只是一些一般的启发式方法/经验知识,它将告诉您是否有可能DL不能很好地工作(因为可以预测该方法的失败,而几乎无法预测成功),仅此而已。在目前的研究中,DL很少适用于小于千分之千的样本的数据集(我不算MNIST,因为在MNIST上一切都很好)。此外,实际上对DL仅在两种类型的问题中进行了深入的研究-NLP和图像处理,因此您不能真正地将其推广到任何其他类型的问题(没有免费的午餐定理)。



更新



只是要使其更加清晰。您要问的是根据给定的估计量(或一组估计量)在特定的训练集下是否会产生良好的结果。实际上,您甚至只限制大小。



最简单的证明(根据您的简化)如下:对于任何 N (样本大小),我可以构造N模式(或N ^ 2使其更明显)的分布,没有估计者可以合理地估计(包括深度神经网络),我可以只用一个标签构造琐碎的数据(因此,理想模型只需要一个样本)。证明结束(相同的N有两个不同的答案)。



现在让我们假设我们确实可以访问训练样本(暂时没有标签),而不仅仅是样本量。现在我们得到大小为N的X(训练样本)。同样,我可以构造N模式标记,从而无法估计分布(按任何方式)和琐碎的标记(仅单个标记!)。再次-完全相同的输入有两个不同的答案。



好,所以也许给定训练样本和标签,我们可以预测什么会表现良好?现在,我们无法操作样本或标签以表明没有此类功能。因此,我们必须回到统计数据以及我们要回答的问题。我们在询问损失函数在整个概率分布上的期望值,该期望值生成了我们的训练样本。因此,现在再说一次,整个线索是,我可以操纵基础分布(构造许多不同的分布,其中许多都无法通过深度神经网络很好地建模),并且仍然期望我的训练样本来自这些分布。这就是统计学家所说的从pdf中获取无法代表的样本的问题。特别是在ML中,我们经常以维度诅咒来解决这个问题。简而言之-为了很好地估计概率,我们需要巨大个样本。 Silverman证明即使您知道您的数据只是正态分布,并且您问 0的值是多少?您需要指数级地采样(与空间维数相比)。实际上,我们的分布是多模式的,复杂的且未知,因此该数量甚至更高。我们可以很肯定地说,在给定的样本数量的情况下,我们无法估计10个以上维度的合理分布。因此,我们为使预期误差最小化所做的一切只是使用试探法,该方法将经验误差(拟合数据)与某种正则化(消除过度拟合,通常通过对分配族进行一些先验假设)。综上所述,我们无法构造一种能够区分模型是否运行良好的方法,因为这将需要确定哪种复杂性分布会生成我们的样本。在某些简单的情况下,我们可以做到这一点-可能他们会说类似哦!即使knn都能很好地工作,这些数据也是如此简单!。但是,对于DNN或任何其他(复杂)模型,您不能拥有通用工具(严格地说-我们可以为非常简单的模型提供这样的预测变量,因为它们的局限性使得我们可以轻松地检查您的数据是否遵循这种极端简单性或不)。



因此,这归结为几乎相同的问题-实际构建模型...因此,您将需要尝试并验证您的方法(因此-训练DNN回答DNN是否工作良好。)您可以在此处使用交叉验证,引导程序或其他任何方法,但是所有必要的操作都相同-构建所需类型的多个模型并进行验证。



总结



我并不是说我们不会有很好的启发式方法,启发式方法会驱动很多部分ML的效果很好。我只有在有一种方法可以回答您的问题时才回答-没有这样的事情并且不存在。可能有很多经验法则,对于某些问题(问题类别),这些规则会很有效。我们已经有了这样的东西:




  • 对于NLP / 2d图像,至少应有100,000个样本才能与DNN一起使用

  • 有很多未标记的实例可以部分替换上面的数字(因此您可以拥有30,000个被标记的实例+ 70,000个未标记的实例),效果相当合理



此外,这并不意味着,鉴于这种数据量,DNN将比内核化SVM甚至线性模型更好。这正是我之前所指的-您可以轻松构造分布的反例,尽管样本数量很多,SVM仍可以工作甚至更好。对于任何其他技术同样如此。



尽管如此,即使您只是对DNN是否能很好地发挥作用(而不是优于其他人)感兴趣,这些都是经验性的,琐碎的启发式方法,这些问题最多基于10(!)种类型的问题。将其视为规则或方法可能非常有害。这只是通过近十年来进行的极其无组织的随机研究获得的粗略的,最初的直觉。



好,所以我现在迷路了...什么时候应该使用DL ?答案非常简单:



仅在以下情况下才使用深度学习:




  • 您已经测试了浅技术,但它们效果不佳

  • 您有大量数据

  • 您具有巨大的计算资源

  • 您具有经验,具有神经网络网络(实际上,这是非常棘手且不适合使用的模型)

  • 您有很多时间可以花时间,即使您获得的效果会好一些


Deep learning has been a revolution recently and its success is related with the huge amount of data that we can currently manage and the generalization of the GPUs.

So here is the problem I'm facing. I know that deep neural nets have the best performance, there is no doubt about it. However, they have a good performance when the number of training examples is huge. If the number of training examples is low it is better to use a SVM or decision trees.

But what is huge? what is low? In this paper of face recognition (FaceNet by Google) they show the performance vs the flops (which can be related with the number of training examples)

They used between 100M and 200M training examples, which is huge.

My question is: Is there any method to predict in advance the number of training examples I need to have a good performance in deep learning??? The reason I ask this is because it is a waste of time to manually classify a dataset if the performance is not going to be good.

解决方案

My question is: Is there any method to predict in advance the number of training examples I need to have a good performance in deep learning??? The reason I ask this is because it is a waste of time to manually classify a dataset if the performance is not going to be good.

The short answer is no. You do not have this kind of knowledge, furthermore you will never have. These kind of problems are impossible to solve, ever.

What you can have are just some general heuristics/empirical knowledge, which will say if it is probable that DL will not work well (as it is possible to predict fail of the method, while nearly impossible to predict the success), nothing more. In current research, DL rarely works well for datasets smaller than hundreads thousands/milions of samples (I do not count MNIST because everything works well on MNIST). Furthermore, DL is heavily studied actually in just two types of problems - NLP and image processing, thus you cannot really extraplate it to any other kind of problems (no free lunch theorem).

Update

Just to make it a bit more clear. What you are asking about is to predit whether given estimator (or set of estimators) will yield a good results given a particular training set. In fact you even restrict just to the size.

The simpliest proof (based on your simplification) is as follows: for any N (sample size) I can construct N-mode (or N^2 to make it even more obvious) distribution which no estimator can reasonably estimate (including deep neural network) and I can construct trivial data with just one label (thus perfect model requires just one sample). End of proof (there are two different answers for the same N).

Now let us assume that we do have access to the training samples (without labels for now) and not just sample size. Now we are given X (training samples) of size N. Again I can construct N-mode labeling yielding impossible to estimate distribution (by anything) and trivial labeling (just a single label!). Again - two different answers for the exact same input.

Ok, so maybe given training samples and labels we can predict what will behave well? Now we cannot manipulate samples nor labels to show that there are no such function. So we have to get back to statistics and what we are trying to answer. We are asking about expected value of loss function over whole probability distribution which generated our training samples. So now again, the whole "clue" is to see, that I can manipulate the underlying distributions (construct many different ones, many of which impossible to model well by deep neural network) and still expect that my training samples come from them. This is what statisticians call the problem of having non-representible sample from a pdf. In particular, in ML, we often relate to this problem with curse of dimensionality. In simple words - in order to estimate the probability well we need enormous number of samples. Silverman shown that even if you know that your data is just a normal distribution and you ask "what is value in 0?" You need exponentialy many samples (as compared to space dimensionality). In practise our distributions are multi-modal, complex and unknown thus this amount is even higher. We are quite safe to say that given number of samples we could ever gather we cannot ever estimate reasonably well distributions with more than 10 dimensions. Consequently - whatever we do to minimize the expected error we are just using heuristics, which connect the empirical error (fitting to the data) with some kind of regularization (removing overfitting, usually by putting some prior assumptions on distributions families). To sum up we cannot construct a method able to distinguish if our model will behave good, because this would require deciding which "complexity" distribution generated our samples. There will be some simple cases when we can do it - and probably they will say something like "oh! this data is so simple even knn will work well!". You cannot have generic tool, for DNN or any other (complex) model though (to be strict - we can have such predictor for very simple models, because they simply are so limited that we can easily check if your data follows this extreme simplicity or not).

Consequently, this boils down nearly to the same question - to actually building a model... thus you will need to try and validate your approach (thus - train DNN to answer if DNN works well). You can use cross validation, bootstraping or anything else here, but all essentialy do the same - build multiple models of your desired type and validate it.

To sum up

I do not claim we will not have a good heuristics, heuristic drive many parts of ML quite well. I only answer if there is a method able to answer your question - and there is no such thing and cannot exist. There can be many rules of thumb, which for some problems (classes of problems) will work well. And we already do have such:

  • for NLP/2d images you should have ~100,000 samples at least to work with DNN
  • having lots of unlabeled instances can partially substitute the above number (thus you can have like 30,000 labeled ones + 70,000 unlabeled) with pretty reasonable results

Furthermore this does not mean that given this size of data DNN will be better than kernelized SVM or even linear model. This is exactly what I was refering to earlier - you can easily construct counterexamples of distributions where SVM will work the same or even better despite number of samples. The same applies for any other technique.

Yet still, even if you are just interested if DNN will work well (and not better than others) these are just empirical, trivial heuristics, which are based on at most 10 (!) types of problems. This could be very harmfull to treat these as rules or methods. This are just rough, first intuitions gained through extremely unstructured, random research that happened in last decade.

Ok, so I am lost now... when should I use DL? And the answer is exteremly simple:

Use deep learning only if:

  • You already tested "shallow" techniques and they do not work well
  • You have large amounts of data
  • You have huge computational resources
  • You have experience with neural networks (this are very tricky and ungreatful models, really)
  • You have great amount of time to spare, even if you will just get a few % better results as an effect.

这篇关于是否有任何技术可以预先知道使深度学习获得良好表现所需的训练示例数量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆