张量梯度更新中的决定论? [英] Determinism in tensorflow gradient updates?

查看:222
本文介绍了张量梯度更新中的决定论?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我有一个用Tensorflow编写的非常简单的NN脚本,而且我很难找到一些随机性来自哪里。

我记录了


  • 权重

  • Gradients

  • Logits



第一次迭代,很明显,一切都开始相同。我有一个SEED值来读取数据,SEED值用于初始化网络的权重。那些我从来没有改变。



我的问题是,在每次重新运行的第二次迭代中,我开始看到渐变发散,少量,比如说,1e-6左右)。然而随着时间的推移,这当然会导致不可重复的行为。

这可能是什么原因造成的?我不知道哪里有任何可能的随机源可能来自... ...

谢谢 class =h2_lin>解决方案如果你在CPU上运行你的网络( export CUDA_VISIBLE_DEVICES = ),在Eigen线程池中有单线程的话,你很有可能得到确定性的结果( tf.Session(config = tf.ConfigProto(intra_op_parallelism_threads = 1)),一个Python线程(没有多线程的队列运行程序,你可以从 tf.batch )和一个定义良好的操作顺序,在一些场景中也可以使用 inter_op_parallelism_threads = 1 一个问题是浮点加法/乘法是不相关的,所以得到确定性结果的一种傻瓜式的方法是使用整数算术或量化值。



除此之外,您可以隔离哪个操作是非确定性的,并尝试避免使用该操作。例如, tf.add_n op,它并没有说明它的值的顺序,但是不同的顺序产生不同的结果。

获得确定性的结果是一场艰苦的斗争因为决定论与表现相冲突,表现通常是得到更多关注的目标。如果你的算法是稳定的,那么你可以得到可重现的结果(即相同的错误分类数量),即使精确的参数值可能略有不同, p>

So I have a very simple NN script written in Tensorflow, and I am having a hard time trying to trace down where some "randomness" is coming in from.

I have recorded the

  • Weights,
  • Gradients,
  • Logits

of my network as I train, and for the first iteration, it is clear that everything starts off the same. I have a SEED value both for how data is read in, and a SEED value for initializing the weights of the net. Those I never change.

My problem is that on say the second iteration of every re-run I do, I start to see the gradients diverge, (by a small amount, like say, 1e-6 or so). However over time, this of course leads to non-repeatable behaviour.

What might the cause of this be? I dont know where any possible source of randomness might be coming from...

Thanks

解决方案

There's a good chance you could get deterministic results if you run your network on CPU (export CUDA_VISIBLE_DEVICES=), with single-thread in Eigen thread pool (tf.Session(config=tf.ConfigProto(intra_op_parallelism_threads=1)), one Python thread (no multi-threaded queue-runners that you get from ops like tf.batch), and a single well-defined operation order. Also using inter_op_parallelism_threads=1 may help in some scenarios.

One issue is that floating point addition/multiplication is non-associative, so one fool-proof way to get deterministic results is to use integer arithmetic or quantized values.

Barring that, you could isolate which operation is non-deterministic, and try to avoid using that op. For instance, there's tf.add_n op, which doesn't say anything about the order in which it sums the values, but different orders produce different results.

Getting deterministic results is a bit of an uphill battle because determinism is in conflict with performance, and performance is usually the goal that gets more attention. An alternative to trying to have exact same numbers on reruns is to focus on numerical stability -- if your algorithm is stable, then you will get reproducible results (ie, same number of misclassifications) even though exact parameter values may be slightly different

这篇关于张量梯度更新中的决定论?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆