分布式张量流:图内复制和图间复制的区别 [英] Distributed tensorflow: the difference between In-graph replication and Between-graph replication

查看:335
本文介绍了分布式张量流:图内复制和图间复制的区别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对两个概念感到困惑:图内复制图间复制读取<在tensorflow的官方方法中使用复制式培训。


  1. 在上面的链接中提到


    在这种方法中,客户端构建一个包含一组参数的
    tf.Graph(在tf.Variable节点中,
    固定到/工作:PS); ...

    这是否意味着多个 tf.Graph code> s在图间
    复制
    方法?如果是,那么
    中的相应代码在哪里是提供的示例?


  2. 虽然已经有一个图间复制在上面的链接的例子,任何人都可以提供一个图内复制
    实现(伪代码是罚款),并突出其主要
    之间的差异图间复制



    预先感谢!





Edit_1:更多问题



非常感谢您的详细解释和要点@mrry @YaroslavBulatov!在查看
的回复后,我有以下两个问题:


  1. 复制培训中的声明:


    图间复制在这种方法中,每个/ jo​​b:worker任务都有一个单独的
    客户端,作为
    工作任务进行处理。每个客户端都使用
    tf.train.replica_device_setter()将它们确定性地映射到$ b $,从而构建一个包含
    参数(固定到/ job:ps的类似图形 b同样的任务);和
    模型的计算密集型部分的单个副本,固定到/ job:worker中的本地任务中。


    我有两个与上述单词相关的子问题。

    (A)为什么我们说每个客户都会建立类似的图表,但不是相同的图表
    我想知道在复制培训示例中的每个客户端内置的图表
    应该是相同的,因为下面的图建筑代码在所有工作人员 s中共享:



    #构建模型...



    损失= ...

    $ global_step = tf.Variable(0)



    (B)由于我们有多个 workers ,所以不应该是
    模型的计算密集部分的的多个副本 c $ c>?


  2. 复制式培训支持多台机器的培训,每台机器都有多个GPU?如果不是,我们是否可以同时使用图内复制来支持在每台机器上对多个
    GPU进行培训,并且
    跨机器训练的图间复制
    >我问这个问题是因为
    @mrry指出图内复制 CIFAR-10多GPU模型示例



解决方案

首先,对于一些历史背景,图内复制是第一种方法我们在TensorFlow中尝试过,并没有达到许多用户需要的性能,所以更复杂的图中间方法是目前推荐的分布式培训方式。高级图书馆(如 tf.learn )使用图中间方式进行分布式培训。 回答您的具体问题:



  1. 这是否意味着有多个 tf .Graph 在图中的
    复制方法中?如果是,那么提供的示例中的相应代码在哪里?

    是的。典型的图间复制设置将为每个工作副本使用一个单独的TensorFlow进程,并且每个副本都将为该模型构建一个单独的 tf.Graph 。通常每个进程都使用全局默认图形(可通过 tf进行访问。 get_default_graph() )并且不会显式创建。



    (原则上,您可以使用单个TensorFlow流程相同的 tf.Graph 和多个 tf.Session 共享相同底层图的对象,只要您将 tf.ConfigProto.device_filters 每个会话的选项都不相同,但这是一种不常见的设置。)


  2. 虽然在上面的链接中已经存在一个图中之间的复制示例,但任何人都可以提供一个图内复制实现(伪代码很好),并强调它与图间复制的主要区别吗?


    由于历史原因,图内复制的实例并不多(雅罗斯拉夫的要点是一个例外)。使用图内复制的程序通常会包含一个循环,为每个工作人员创建相同的图形结构(例如要点的第74行),并使用工人之间的变量共享。

    图内复制依然存在的一个地方是在单个进程中使用多个设备(例如多个GPU)。针对多个GPU的CIFAR-10示例模型 是此模式的一个示例(请参阅GPU设备上的循环)。


(在我看来,多个工人和多个设备因为它不依赖于副本之间的隐式共享,因此图内复制比图间复制更易于理解。更高级别的库,例如 tf .learn 和TF-Slim,隐藏其中一些问题,并希望我们未来能够提供更好的复制方案。)



  1. 为什么w每个客户建立一个类似的图表,但不是相同的图表?



    <因为它们不需要相同(并且没有执行此操作的完整性检查)。特别是,每个工作人员可以创建一个具有不同显式设备分配(/ job:worker / task:0> / job:worker /任务:1等)。主要工作人员可能会创建额外的操作,这些操作不是由非主要工作人员创建的(或由其使用)。然而,在大多数情况下,图表在逻辑上(即模数设备分配)是相同的。


    不应该是多重因为我们有多个工人?


    通常,每个工人都有一个计算密集型部分的模型的副本包含模型的计算密集型部分的单个副本的单独图形。 worker i 的图不包含worker j 的节点(假设i≠ j)。 (例外的情况是,您使用图形间复制进行分布式培训,并使用图形复制来在每个工作人员中使用多个GPU。在这种情况下,工作人员的图形通常包含 N 图的计算密集型部分的副本,其中 N 是该工作者中GPU的数量。)



  2. 复制式培训中的示例支持在多台机器上进行培训,每台机器上都有多个GPU?

    示例代码仅涵盖多台机器上的培训,如何在每台机器上的多个GPU上进行训练。但是,这些技术很容易组成。在这部分的例子中:

     #构建模型... 
    损失= ...

    ...您可以在本地机器上的GPU上添加一个循环,以实现分布式培训, GPUs。


I got confused about the two concepts: In-graph replication and Between-graph replication when reading the Replicated training in tensorflow's official How-to.

  1. It's said in above link that

    In-graph replication. In this approach, the client builds a single tf.Graph that contains one set of parameters (in tf.Variable nodes pinned to /job:ps); ...

    Does this mean there are multiple tf.Graphs in Between-graph replication approach? If yes, where are the corresponding codes in the provided examples?

  2. While there is already a Between-graph replication example in above link, could anyone provide a In-graph replication implementation (pseudo code is fine) and highlight its main differences from Between-graph replication?

    Thanks in advance!


Edit_1: more questions

Thanks a lot for your detailed explanations and gist code @mrry @YaroslavBulatov ! After looking your responses, I have the following two questions:

  1. There is the following statement in Replicated training:

    Between-graph replication. In this approach, there is a separate client for each /job:worker task, typically in the same process as the worker task. Each client builds a similar graph containing the parameters (pinned to /job:ps as before using tf.train.replica_device_setter() to map them deterministically to the same tasks); and a single copy of the compute-intensive part of the model, pinned to the local task in /job:worker.

    I have two sub-questions related to above words in bold.

    (A) Why do we say each client builds similar graph, but not same graph? I wonder the graph built in each client in the example of Replicated training should be the same because below graph construction codes are shared within all workers.:

    # Build model...

    loss = ...

    global_step = tf.Variable(0)

    (B) Shouldn't it be multiple copies of compute-intensive part of the model, since we have multiple workers?

  2. Does the example in Replicated training support training on multiple machines, each of which has multiple GPUs? If not, can we use simultaneously both the In-graph replication to support training on multiple GPUs on each machine and Between-graph replication for cross-machine training? I ask this question because @mrry indicated that the In-graph replication is essentially same to the way used in CIFAR-10 example model for multiple GPUs.

解决方案

First of all, for some historical context, "in-graph replication" is the first approach that we tried in TensorFlow, and it did not achieve the performance that many users required, so the more complicated "between-graph" approach is the current recommended way to perform distributed training. Higher-level libraries such as tf.learn use the "between-graph" approach for distributed training.

To answer your specific questions:

  1. Does this mean there are multiple tf.Graphs in the between-graph replication approach? If yes, where are the corresponding codes in the provided examples?

    Yes. The typical between-graph replication setup will use a separate TensorFlow process for each worker replica, and each of this will build a separate tf.Graph for the model. Usually each process uses the global default graph (accessible through tf.get_default_graph()) and it is not created explicitly.

    (In principle, you could use a single TensorFlow process with the same tf.Graph and multiple tf.Session objects that share the same underlying graph, as long as you configured the tf.ConfigProto.device_filters option for each session differently, but this is an uncommon setup.)

  2. While there is already a between-graph replication example in above link, could anyone provide an in-graph replication implementation (pseudocode is fine) and highlight its main differences from between-graph replication?

    For historical reasons, there are not many examples of in-graph replication (Yaroslav's gist is one exception). A program using in-graph replication will typically include a loop that creates the same graph structure for each worker (e.g. the loop on line 74 of the gist), and use variable sharing between the workers.

    The one place where in-graph replication persists is for using multiple devices in a single process (e.g. multiple GPUs). The CIFAR-10 example model for multiple GPUs is an example of this pattern (see the loop over GPU devices here).

(In my opinion, the inconsistency between how multiple workers and multiple devices in a single worker are treated is unfortunate. In-graph replication is simpler to understand than between-graph replication, because it doesn't rely on implicit sharing between the replicas. Higher-level libraries, such as tf.learn and TF-Slim, hide some of these issues, and offer hope that we can offer a better replication scheme in the future.)

  1. Why do we say each client builds a similar graph, but not the same graph?

    Because they aren't required to be identical (and there is no integrity check that enforces this). In particular, each worker might create a graph with different explicit device assignments ("/job:worker/task:0", "/job:worker/task:1", etc.). The chief worker might create additional operations that are not created on (or used by) the non-chief workers. However, in most cases, the graphs are logically (i.e. modulo device assignments) the same.

    Shouldn't it be multiple copies of the compute-intensive part of the model, since we have multiple workers?

    Typically, each worker has a separate graph that contains a single copy of the compute-intensive part of the model. The graph for worker i does not contain the nodes for worker j (assuming i ≠ j). (An exception would be the case where you're using between-graph replication for distributed training, and in-graph replication for using multiple GPUs in each worker. In that case, the graph for a worker would typically contain N copies of the compute-intensive part of the graph, where N is the number of GPUs in that worker.)

  2. Does the example in Replicated training support training on multiple machines, each of which has multiple GPUs?

    The example code only covers training on multiple machines, and says nothing about how to train on multiple GPUs in each machine. However, the techniques compose easily. In this part of the example:

    # Build model...
    loss = ...
    

    ...you could add a loop over the GPUs in the local machine, to achieve distributed training multiple workers each with multiple GPUs.

这篇关于分布式张量流:图内复制和图间复制的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆