TensorFlow的ParameterServerStrategy何时比其MultiWorkerMirroredStrategy更可取? [英] When is TensorFlow's ParameterServerStrategy preferable to its MultiWorkerMirroredStrategy?

查看:431
本文介绍了TensorFlow的ParameterServerStrategy何时比其MultiWorkerMirroredStrategy更可取?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在跨多个服务器和GPU训练神经网络时,我无法想到 ParameterServerStrategy MultiWorkerMirroredStrategy更合适的情况

When training a neural network across multiple servers and GPUs, I can't think of a scenario where the ParameterServerStrategy would be preferable to the MultiWorkerMirroredStrategy.

ParameterServerStrategy 的主要用例是什么,为什么比使用 MultiWorkerMirroredStrategy

What are the ParameterServerStrategy's main use cases and why would it be better than using MultiWorkerMirroredStrategy?

推荐答案


  • MultiWorkerMirroredStrategy 用于在多个工作人员之间进行同步分布式培训,每个工作人员可以具有多个GPU

    • MultiWorkerMirroredStrategy is intended for synchronous distributed training across multiple workers, each of which can have multiple GPUs

      ParameterServerStrategy :支持参数服务器。它可用于多GPU同步本地训练或异步多机训练。

      ParameterServerStrategy: Supports parameter servers. It can be used for multi-GPU synchronous local training or asynchronous multi-machine training.

      关键区别之一是ParameterServerStrategy可用于异步训练,而MultiWorkerMirroredStrategy用于同步分布式训练。在MultiWorkerMirroredStrategy中,模型中所有变量的副本将保留在所有工作人员的每个设备上,并且需要一种通信方法来使所有变量保持同步。相反,在ParameterServerStrategy中,模型的每个变量都放在一个参数服务器上。

      One of the key differences is that ParameterServerStrategy can be used for asynchronous training, while MultiWorkerMirroredStrategy is intended for Synchronous distributed training. In MultiWorkerMirroredStrategy a copy of all variables in the model is kept on each device across all workers, and a communication method is needed to keep all variables in sync. In contrast, in ParameterServerStrategy each variable of the model is placed on one parameter server.

      这很重要,因为:


      • 在同步培训中,所有工作人员在培训时期和步骤方面保持同步,其他工作人员将需要等待失败或被抢占的工作人员重新启动才能继续。如果失败或被抢占的工作程序由于某种原因没有重新启动,您的工作程序将继续等待。

      • In synchronous training, all the workers are kept in sync in terms of training epochs and steps, other workers would need to wait for the failed or preempted worker to restart to continue. If the failed or preempted worker does not restart for some reason, your workers will keep waiting.

      相反,在ParameterServerStrategy中,每个工作程序都在运行相同的代码独立运行,但参数服务器运行的是标准服务器。这意味着尽管每个工作人员将在所有GPU上同步计算单个渐变更新,但工作人员之间的更新将异步进行。仅在第一个副本上发生的操作(例如增加全局步长)将在每个工作程序的第一个副本上发生。因此,与MultiWorkerMirroredStrategy不同,不同的工人没有彼此等待。

      In contrast in ParameterServerStrategy, each worker is running the same code independently, but parameter servers are running a standard server. This means that while each worker will synchronously compute a single gradient update across all GPUs, updates between workers proceed asynchronously. Operations that occur only on the first replica (such as incrementing the global step), will occur on the first replica of every worker. Hence unlike MultiWorkerMirroredStrategy, different workers are not waiting on each other.

      我想问题是,您期望工人吗?失败,并且在MultiWorkerMirroredStrategy时重新启动它们的延迟会减慢训练速度吗?在这种情况下,也许ParameterServerStrategy更好。

      I guess the question is, do you expect workers to fail, and will the delay in restarting them slow down training when MultiWorkerMirroredStrategy ? If that is the case, maybe ParameterServerStrategy is better.

      编辑:注释中问题的答案:


      PSS的唯一好处是,它比MWMS更能抵抗
      失败的工人吗?

      So is the only benefit of PSS the fact that it resists better to failing workers than MWMS?

      即使工作人员不会在MWMS中失败,但由于工作人员仍需要保持同步,因此可能存在网络瓶颈。

      Not exactly - even if workers do not fail in MWMS, as workers still need to be in sync there could be network bottle necks.


      如果是这样,那么我想它将仅在对许多
      工人(例如20名或更多)进行培训时有用,否则,工人在培训期间失败
      的可能性很低(可以通过保存常规的
      快照来避免这种情况) 。

      If so, then I imagine it would only be useful when training on many workers, say 20 or more, or else the probability that a worker will fail during training is low (and it can be avoided by saving regular snapshots).

      也许不是,这取决于情况。也许在您的情况下,失败的可能性很低。在其他人的情况下,可能会更高。对于相同数量的工人,工作时间越长,在工作中间发生失败的可能性就越大。为了进一步说明(通过一个过于简单的示例),如果我拥有相同数量的节点,但是它们速度较慢,则它们可能需要更长的时间才能完成工作,因此,在此期间发生任何类型的中断/故障的可能性更大

      Maybe not, it depends on the situation. Perhaps in your scenario the probability of failure is low. In someone else's scenario there may be a higher probability. For the same number of workers, the longer a job is, there is more likelihood of a failure occurring in the middle of a job. To illustrate further (with an over simplistic example), if I have the same number of nodes, but theyre simply slower, they could take much longer to do a job, and hence there is greater likelihood of any kind of interruption / failure occurring during the job.


      (而且可以通过保存常规快照来避免)。

      (and it can be avoided by saving regular snapshots).

      不确定我是否理解您的意思-如果工作人员失败了,并且您保存了快照,那么您就不会丢失数据。但是工作人员仍然需要重新启动。在故障和重启其他工人之间可能正在等待。

      Not sure I understand what you mean - if a worker fails, and you've saved a snapshot, then you haven't lost data. But the worker still needs to restart. In the interim between failure and restarting other workers may be waiting.


      I / O饱和是否可能带来好处?如果更新是
      异步的,则I / O会在时间上更分散,对吗?但是也许
      的这种好处被使用更多I / O的事实所抵消了吗? $ p $ b请您详细说明一下吗?

      Isn't there a possible benefit with I/O saturation? If the updates are asynchronous, I/O would be more spread out in time, right? But maybe this benefit is cancelled by the fact that it uses more I/O? Could you please detail this a bit?

      我将首先尝试从概念角度回答这个问题。

      I will first try to answer it from a conceptual point of view.


      • 我会说尝试从另一个角度看待它-在同步操作中,您正在等待其他操作完成,在此之前您可能会闲着某些东西可以满足您的需求。
        与异步操作相反,您可以做自己的工作,当需要更多工作时,您可以要求它。

      • I would say try looking at it from a different angle - in a synchronous operation, you're waiting for something else to finish, and you may be idle till that something gives you what you need. In constrast in an asynchronous operation, you do your own work and when you need more you ask for it.

      有没有关于同步操作或异步操作更好的硬性规定。这取决于情况。

      There is no hard and fast rule about whether synchronous operations or asynchronous operations are better. It depends on the situation.

      我现在尝试从优化的角度来回答这个问题:

      I will now try to answer it from an optimization point of view:


      I / O饱和是否可能带来好处?如果更新是
      异步的,则I / O会在时间上更分散,对吗?但是也许
      的这种好处被使用更多I / O的事实所抵消了吗?
      请详细介绍一下吗?

      Isn't there a possible benefit with I/O saturation? If the updates are asynchronous, I/O would be more spread out in time, right? But maybe this benefit is cancelled by the fact that it uses more I/O? Could you please detail this a bit?

      在分布式系统中,瓶颈可能是CPU / GPU,磁盘或网络。如今,网络确实非常快,在某些情况下还比磁盘快。根据您的工作人员配置,CPU / GPU可能会成为瓶颈。因此,它实际上取决于您的硬件和网络的配置。

      In a distributed system it is possible that your bottleneck could be CPU / GPU, Disk or Network. Nowadays networks are really fast, and in some cases faster than disk. Depending on your workers configuration CPU / GPU could be the bottle neck. So it really depends on the configuration of your hardware and network.

      因此,我将进行一些性能测试,以确定系统瓶颈在哪里,并针对您的特定问题进行优化。

      Hence I would do some performance testing to determine where the bottlenecks in your system are, and optimize for your specific problem.

      编辑:其他后续问题:


      最后一件事:根据您的经验, PSS使用什么用例?我的意思是,PSS和MWMS显然都可用于大型数据集(或
      ,否则一台机器就足够了),但是模型呢?对于大型机型,
      PSS会更好吗?并且根据您的经验,MWMS是否会更多地使用

      One last thing: in your experience, in what use cases is PSS used? I mean, both PSS and MWMS are obviously for use with large datasets (or else a single machine would suffice), but what about the model? Would PSS be better for larger models? And in your experience, is MWMS more frequently used?

      我认为成本和要解决的问题类型可能会影响选择。例如,AWS和GCP都提供现货实例 /可替代实例,它们是打折的服务器,可以随时拿走。在这种情况下,使用PSS可能很有意义-即使发生机器故障的可能性很小,由于实例是现场实例,因此实例可能会被带走而无需事先通知。如果使用PSS,则服务器消失对性能的影响可能不如使用MWMS时大。
      如果您使用的是专用实例,则这些实例是专用于您的,不会被删除-唯一的中断风险是计算机故障。在这种情况下,如果您可以利用性能优化或插件体系结构,则MWMS可能更具吸引力。

      I think cost and the type of problem being worked on may influence the choice. For example, both AWS and GCP offer "spot instances" / "premptible instances" which are heavily discounted servers that can be taken away at any moment. In such a scenario, it may make sense to use PSS - even though machine failure is unlikely, a instance may simply be taken away without notice because it is a "spot instance". If you use PSS, then the performance impact of servers disappearing may not be as large as when using MWMS. If you’re using dedicated instances, the instances are dedicated to you, and will not be taken away - the only risk of interruption is machine failure. In such cases MWMS may be more attractive if you can take advantage of performance optimisations or plugin architecture.

      这篇关于TensorFlow的ParameterServerStrategy何时比其MultiWorkerMirroredStrategy更可取?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆