在GPU上训练时如何处理不确定性? [英] How to handle non-determinism when training on a GPU?

查看:138
本文介绍了在GPU上训练时如何处理不确定性?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在调整超参数以使模型更好地运行时,我注意到尽管为随机操作固定了所有种子,但每次运行代码时得到的分数(以及因此创建的模型)都是不同的.如果我在CPU上运行,则不会发生此问题.

While tuning the hyperparameters to get my model to perform better, I noticed that the score I get (and hence the model that is created) is different every time I run the code despite fixing all the seeds for random operations. This problem does not happen if I run on CPU.

我搜索了一下,发现这是使用GPU进行训练时的常见问题. 此处是一个很好/详细的例子,简短代码段以验证该问题的存在.

I googled and found out that this is a common issue when using a GPU to train. Here is a very good/detailed example with short code snippets to verify the existence of that problem.

他们指出了"tf.reduce_sum"功能的不确定性.但是,对我而言并非如此.可能是因为我使用了不同的硬件(1080 TI)或不同版本的CUDA库或Tensorflow.似乎CUDA库中有许多不同的部分是不确定的,要弄清楚到底是哪一部分以及如何摆脱它似乎并不容易.另外,这必须是设计使然,因此很有可能会充分提高效率,以换取非确定性.

They pinpointed the non-determinism to "tf.reduce_sum" function. However, that is not the case for me. it could be because I'm using different hardware (1080 TI) or a different version of CUDA libraries or Tensorflow. It seems like there are many different parts of the CUDA libraries that are non-deterministic and it doesn't seem easy to figure out exactly which part and how to get rid of it. Also, this must have been by design, so it's likely that there is a sufficient efficiency increase in exchange for non-determinism.

所以,我的问题是:

由于GPU在训练NN方面很受欢迎,因此该领域的人们必须有一种处理不确定性的方法,因为我看不到您还能如何可靠地调整超参数.使用GPU时处理不确定性的标准方法是什么?

Since GPUs are popular for training NNs, people in this field must have a way to deal with non-determinism, because I can't see how else you'd be able to reliably tune the hyperparameters. What is the standard way to handle non-determinism when using a GPU?

推荐答案

TL; DR

  • 先验确定性操作的不确定性来自并发(多线程)实现.
  • 尽管在这方面不断取得进展,但TensorFlow当前不能保证其所有操作都具有确定性.在互联网上快速搜索后,情况似乎与其他主要工具包相似.
  • 在培训期间,除非您要调试问题,否则两次跑步之间可以有波动.不确定性是训练的本质,明智的做法是测量它,并在比较结果时将其考虑在内–即使工具包最终在训练中达到完美的确定性,我们仍会对此加以考虑.
  • Non-determinism for a priori deterministic operations come from concurrent (multi-threaded) implementations.
  • Despite constant progress on that front, TensorFlow does not currently guarantee determinism for all of its operations. After a quick search on the internet, it seems that the situation is similar with the other major toolkits.
  • During training, unless you are debugging an issue, it is OK to have fluctuations between runs. Uncertainty is in the nature of training, and it is wise to measure it and take it into account when comparing results – even when toolkits eventually reach perfect determinism in training.

那个,但是要长得多

当您将神经网络运算视为数学运算时,您会期望一切都是确定性的.卷积,激活,交叉熵–这里的所有内容都是数学方程式,应该是确定性的.甚至伪随机操作(例如混洗,掉线,噪声等)也完全由种子决定.

When you see neural network operations as mathematical operations, you would expect everything to be deterministic. Convolutions, activations, cross-entropy – everything here are mathematical equations and should be deterministic. Even pseudo-random operations such as shuffling, drop-out, noise and the likes, are entirely determined by a seed.

另一方面,当您从它们的计算实现中看到这些操作时,会发现它们是大规模并行化的计算,除非您非常小心,否则它们可能是随机性的来源.

When you see those operations from their computational implementation, on the other hand, you see them as massively parallelized computations, which can be source of randomness unless you are very careful.

问题的核心是,当您在多个并行线程上运行操作时,通常不知道哪个线程将首先结束.当线程对自己的数据进行操作时,这并不重要,例如,将激活函数应用于张量应该是确定性的.但是,当这些线程需要同步时(例如,计算总和时),则结果可能取决于求和的顺序,进而取决于哪个线程最先结束的顺序.

The heart of the problem is that, when you run operations on several parallel threads, you typically do not know which thread will end first. It is not important when threads operate on their own data, so for example, applying an activation function to a tensor should be deterministic. But when those threads need to synchronize, such as when you compute a sum, then the result may depend on the order of the summation, and in turn, on the order of which thread ended first.

从那里开始,您可以广泛地说两个选择:

From there, you have broadly speaking two options:

  • 保持与更简单实现相关的不确定性.

  • Keep non-determinism associated with simpler implementations.

在并行算法的设计中要格外小心,以减少或消除计算中的不确定性.增加的约束通常会导致算法变慢

Take extra care in the design of your parallel algorithm to reduce or remove non-determinism in your computation. The added constraint usually results in slower algorithms

哪个路线采用CuDNN?好吧,主要是确定性的.在最新版本中,确定性操作是规范,而不是例外.但是它曾经提供许多不确定性操作,更重要的是,它过去不提供某些操作,例如归约,人们需要在对CUDA进行可变性考虑的情况下在CUDA中实现自己.

Which route takes CuDNN? Well, mostly the deterministic one. In recent releases, deterministic operations are the norm rather than the exception. But it used to offer many non-deterministic operations, and more importantly, it used to not offer some operations such as reduction, that people needed to implement themselves in CUDA with a variable degree of consideration to determinism.

theano之类的某些库通过在

Some libraries such as theano were more ahead of this topic, by exposing early on a deterministic flag that the user could turn on or off – but as you can see from its description, it is far from offering any guarantee.

如果为more,有时我们会选择一些确定性更高但更慢的实现.特别是在GPU上,我们将避免使用AtomicAdd.有时我们仍然会使用不确定性的实现方式,例如当我们没有确定性的GPU实现时.另请参阅dnn.conv.algo *标志以涵盖更多情况.

If more, sometimes we will select some implementation that are more deterministic, but slower. In particular, on the GPU, we will avoid using AtomicAdd. Sometimes we will still use non-deterministic implementaion, e.g. when we do not have a GPU implementation that is deterministic. Also see the dnn.conv.algo* flags to cover more cases.

在TensorFlow中,确定性需求的实现还为时已晚,但进展缓慢-在这方面CuDNN的发展也为之提供了帮助.长期以来,减少一直是不确定的,但现在看来似乎是确定的.当然,CuDNN在6.0版中引入确定性缩减的事实当然可以有所帮助.

In TensorFlow, the realization of the need for determinism has been rather late, but it's slow getting there – helped by the advance of CuDNN on that front also. For a long time, reductions have been non-deterministic, but now they seem to be deterministic. The fact that CuDNN introduced deterministic reductions in version 6.0 may have helped of course.

目前看来, TensorFlow走向确定性的主要障碍是向后传递卷积.实际上,这是CuDNN提出标记为CUDNN_CONVOLUTION_BWD_FILTER_ALGO_0的不确定性算法的少数几个操作之一.该算法仍在列表中, TensorFlow中的向后过滤器.并且由于选择过滤器似乎是基于性能,如果效率更高的话,确实可以选择它. (我对TensorFlow的C ++代码不太熟悉,因此请耐心等待.)

It seems that currently, the main obstacle for TensorFlow towards determinism is the backward pass of the convolution. It is indeed one of the few operations for which CuDNN proposes a non-deterministic algorithm, labeled CUDNN_CONVOLUTION_BWD_FILTER_ALGO_0. This algorithm is still in the list of possible choices for the backward filter in TensorFlow. And since the choice of the filter seems to be based on performance, it could indeed be picked if it is more efficient. (I am not so familiar with TensorFlow's C++ code so take this with a grain of salt.)

这很重要吗?

如果要调试问题,确定性不仅重要,而且是强制性的.您需要重现导致问题的步骤.对于TensorFlow这样的工具包,这目前是一个真正的问题.为了缓解此问题,您唯一的选择是实时调试,在正确的位置添加检查和断点-不太好.

If you are debugging an issue, determinism is not merely important: it is mandatory. You need to reproduce the steps that led to a problem. This is currently a real issue with toolkits like TensorFlow. To mitigate this problem, your only option is to debug live, adding checks and breakpoints at the correct locations – not great.

部署是事物的另一个方面,通常需要具有确定性的行为,部分原因是要为人类所接受.尽管没有人会合理地期望医学诊断算法永远不会失败,但是计算机会根据运行情况为同一位患者提供不同的诊断,这是很尴尬的. (尽管医生本身也不会对这种可变性产生免疫力.)

Deployment is another aspect of things, where it is often desirable to have a deterministic behavior, in part for human acceptance. While nobody would reasonably expect a medical diagnosis algorithm to never fail, it would be awkward that a computer could give the same patient a different diagnosis depending on the run. (Although doctors themselves are not immune to this kind of variability.)

这些原因是修复神经网络中不确定性的正确动机.

Those reasons are rightful motivations to fix non-determinism in neural networks.

对于所有其他方面,我想说的是,我们需要接受(如果不接受)神经网络训练的不确定性.出于所有目的,培训 是随机的.我们使用随机梯度下降,随机播放数据,使用随机初始化和丢弃-更重要的是,训练数据本身只是数据的随机样本.从这个角度来看,计算机只能生成带有种子的伪随机数,这一事实是一种假象.训练时,由于这种随机性,您的损失也是一个带有置信区间的值.比较这些值以优化超参数,而忽略那些置信区间,这没有多大意义-因此,我认为,在那种情况以及其他许多情况下,花费过多的精力来解决不确定性问题,都是徒劳的.

For all other aspects, I would say that we need to accept, if not embrace, the non-deterministic nature of neural net training. For all purposes, training is stochastic. We use stochastic gradient descent, shuffle data, use random initialization and dropout – and more importantly, training data is itself but a random sample of data. From that standpoint, the fact that computers can only generate pseudo-random numbers with a seed is an artifact. When you train, your loss is a value that also comes with a confidence interval due to this stochastic nature. Comparing those values to optimize hyper-parameters while ignoring those confidence intervals does not make much sense – therefore it is vain, in my opinion, to spend too much effort fixing non-determinism in that, and many other, cases.

这篇关于在GPU上训练时如何处理不确定性?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆