Tensorflow中的同步与异步计算 [英] Synchronous vs asynchronous computation in Tensorflow

查看:355
本文介绍了Tensorflow中的同步与异步计算的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Tensorflow CIFAR教程中,它讨论了使用多个GPU并给出以下警告:

In the Tensorflow CIFAR tutorial it talks about using multiple GPUs and gives this warning:

天真地采用模型参数的异步更新会导致次优的训练性能,因为单个模型副本可能会在模型参数的陈旧副本上进行训练.相反,采用完全同步更新的速度将与最慢的模型副本一样慢."

"Naively employing asynchronous updates of model parameters leads to sub-optimal training performance because an individual model replica might be trained on a stale copy of the model parameters. Conversely, employing fully synchronous updates will be as slow as the slowest model replica."

这是什么意思?有人可以提供一个非常简单的示例来说明此警告吗?

What does this mean? Could someone provide a very simple example that illustrates this warning?

推荐答案

假设您有n个工作人员.

Suppose you have n workers.

异步意味着每个工作人员仅读取参数,计算更新并写入更新的参数,而根本没有任何锁定机制.工人可以自由地覆盖彼此的工作. 假设工作者1由于某种原因而运行缓慢.工作程序1在时间t读取参数,然后尝试在时间t + 100写入更新的参数.同时,工作人员2-n在时间步t + 1,t + 2等处都进行了很多更新.当慢工作人员1最终进行写操作时,它将覆盖其他工作人员所取得的所有进展.

Asynchronous means that each worker just reads parameters, computes updates, and writes updated parameters, without any locking mechanism at all. The workers can overwrite each other's work freely. Suppose worker 1 is slow for some reason. Worker 1 reads parameters at time t, and then tries to write updated parameters at time t+100. In the meantime, workers 2-n have all done a lot of updates at time step t+1, t+2, etc. When the slow worker 1 finally does its write, it overwrites all of the progress the other workers have made.

完全同步意味着所有工人都得到了协调.每个工作人员都会读取参数,计算梯度,然后等待其他工作人员完成.然后,学习算法将计算出它们计算出的所有梯度的平均值,并根据该平均值进行更新.如果工作人员1非常慢并且需要100个时间步才能完成,但是工作人员2-n都在时间2上完成,那么大多数工作人员将大部分时间都花在无所事事地等待工作人员1.

Fully synchronous means that all the workers are coordinated. Every worker reads the parameters, computes a gradient, and then waits for the other workers to finish. Then the learning algorithm computes the average of all of the gradients they computed, and does an update based on that one average. If worker 1 is very slow and takes 100 time steps to finish, but workers 2-n all finish on time step 2, then most of the workers will spend most of the time sitting doing nothing waiting for worker 1.

这篇关于Tensorflow中的同步与异步计算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆