使用tensorflow-gpu获得可再现的结果 [英] Getting reproducible results using tensorflow-gpu

查看:158
本文介绍了使用tensorflow-gpu获得可再现的结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用Tensorflow处理项目.但是,我似乎无法重现我的结果.

Working on a project using Tensorflow. However, I can't seem to reproduce my results.

我尝试设置图形级种子,numpy随机种子甚至操作级种子.但是,它仍然无法重现.

I have tried setting the graph level seed, numpy random seed and even operation level seeds. However, it still not reproducible.

在搜索Google时,大多数人都将reduce_sum函数作为罪魁祸首,因为即使设置了种子,reduce_sum函数在gpu上也具有不确定性.但是,由于我正在从事论文项目,因此需要重现结果.还有其他有效的功能可以解决此问题吗?

On searching Google, most people point to the reduce_sum function as the culprit as the reduce_sum function has a non-deterministic property on gpu even after setting the seeds. However, since I am working on a project for a paper, I need to reproduce the results. Is there any other efficient function that can work around this?

另一个建议是使用CPU.但是,我正在处理错误数据,并且这种CPU不是一个选择.使用Tensorflow从事复杂项目的人们如何解决这个问题?还是可以接受审阅者加载保存的模型检查点文件以进行结果验证?

Another suggestion was to use CPU. However, I'm working on bug data and such CPU is not an option. How do people working on complex projects using Tensorflow work around this issue? Or it is acceptable to reviewers to load the saved model checkpoint file for result verification?

推荐答案

很酷,您想使结果可重复!但是,这里有很多事情要注意:

Cool, that you want to make your results reproducible! However, there are many things to note here:

如果可以获得完全相同的论文,我将其称为可复制 通过执行完全相同的论文中发现的数字 脚步.这意味着如果可以访问同一环境, 相同的软件,硬件和数据,人们将能够 得到相同的结果.相反,一篇论文称为可复制 如果只有一个人可以达到相同的结果 本文的文字描述.因此,复制性更难 达到,而且是衡量质量的更强有力的指标 纸

I call a paper reproducible if one can obtain exactly the same numbers as found in the paper by executing exactly the same steps. This means if one had access to the same environment, the same software, hardware and data, one would be able to get the same results. In contrast, a paper is called replicatable if one can achieve the same results if one only follows the textual description in the paper. Hence replicability is harder to achieve, but also a more powerful indicator of the quality of the paper

您希望在按位相同的模型上获得训练结果.圣杯是写纸的方式是,只要人们只有纸,他们仍然可以确认您的结果.

You want to achieve that the training results on a bit-wise identical model. The holy grail would be to write your paper in a way that if people ONLY have the paper, they can still confirm your results.

还请注意,在许多重要论文中,实际上都无法复制结果:

Please also note that in many important papers results are practically impossible to reproduce:

  • 数据集通常不可用: JFT-300M
  • 大量使用计算能力:对于Google的一篇AutoML/Architecture Search论文,我问作者在其中一项实验中花费了多少GPU小时.当时,如果我想要那么多的GPU小时,那将花费我大约25万美元.

如果这是一个问题,则在很大程度上取决于上下文.作为比较,请考虑CERN/LHC:不可能进行完全相同的实验.地球上只有极少数的机构拥有检查结果的工具.仍然不是问题.因此,请问您的顾问/已经在该期刊/会议上发表过论文的人.

If that is a problem, depends very much on the context. As a comparison, think of CERN / LHC: It is impossible to have completely identical experiments. Only very few institutions on earth have the instruments to check the results. Still it is not a problem. So ask your advisor / people who have already published in that journal / conference.

这太难了.我认为以下方法会有所帮助:

This is super hard. I think the following is helpful:

  • 确保您提到的质量指标没有太多数字
  • 由于训练可能取决于随机初始化,因此您可能还希望给出一个间隔而不是一个数字
  • 尝试较小的变化
  • 从头开始重新实现(也许使用另一个库?)
  • 请同事阅读您的论文,然后向您解释他们的想法.

在我看来,您已经在做重要的事情:

It seems to me that you already do the important things:

  • 设置所有种子:numpytensorflowrandom,...
  • 确保训练与测试"划分是一致的
  • 确保按相同顺序加载训练数据
  • Setting all seeds: numpy, tensorflow, random, ...
  • Making sure the Training-Test split is consistent
  • Making sure the training data is loaded in the same order

请注意,可能有些因素无法控制:

Please note that there might be factors out of your control:

  • Bitflips :B.Schroeder,E.Pinheiro和W.-D. Weber,"Dram错误 野生:大规模的现场研究"
  • 固有的硬件/软件可再现性问题:正在浮动 点乘法不是关联的, GPU上的不同核心可能会在以下位置完成计算 不同的时间.因此,每次运行都可能导致不同的结果 结果. (如果有人可以在这里提供权威的参考,我会很高兴)
  • Bitflips: B. Schroeder, E. Pinheiro, and W.-D. Weber, "Dram errors in the wild: a large-scale field study"
  • Inherent Hardware/Software reproducibility problems: Floating point multiplication is not associative and different cores on a GPU might finish computations at different times. Thus each single run could lead to different results. (I'd be happy if somebody could give an authorative reference here)

这篇关于使用tensorflow-gpu获得可再现的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆