张量流:多GPU和分布式张量流之间的差异 [英] tensorflow: difference between multi GPUs and distributed tensorflow

查看:76
本文介绍了张量流:多GPU和分布式张量流之间的差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对这两个概念感到困惑。

I am little confused about these two concepts.

我看到了一些有关多GPU的示例,这些示例未在代码中使用群集和服务器。

I saw some examples about multi GPU without using clusters and servers in the code.

这两个不同吗?有什么不同?

Are these two different? What is the difference?

非常感谢!

推荐答案

这在一定程度上取决于您从中观察的角度。在任何多*设置(多GPU或多计算机)中,您需要决定如何在并行资源之间分配计算。在单节点,多GPU设置中,有两个非常合理的选择:

It depends a little on the perspective from which you look at it. In any multi-* setup, either multi-GPU or multi-machine, you need to decide how to split up your computation across the parallel resources. In a single-node, multi-GPU setup, there are two very reasonable choices:

(1)模型内并行性。如果模型具有较长且独立的计算路径,则可以将模型拆分到多个GPU中,并让每个GPU进行计算。这需要仔细了解模型和计算依赖性。

(1) Intra-model parallelism. If a model has long, independent computation paths, then you can split the model across multiple GPUs and have each compute a part of it. This requires careful understanding of the model and the computational dependencies.

(2)重复训练。启动模型的多个副本,对其进行训练,然后同步其学习(将梯度应用于其权重和偏差)。

(2) Replicated training. Start up multiple copies of the model, train them, and then synchronize their learning (the gradients applied to their weights & biases).

我们的已发布的Inception模型在自述文件中有一些不错的图表,显示了多GPU和分布式培训如何工作。

Our released Inception model has some good diagrams in the readme that show how both multi-GPU and distributed training work.

但是对于tl; dr来说,在多GPU设置中,通常最好通过将权重存储在CPU(以及在其连接的DRAM中)中来同步更新模型。但是在多机设置中,我们经常使用单独的参数服务器来存储和传播权重更新。要将其扩展到很多副本,可以在多个参数服务器上分片参数。

But to tl;dr that source: In a multi-GPU setup, it's often best to synchronously update the model by storing the weights on the CPU (well, in its attached DRAM). But in a multi-machine setup, we often use a separate "parameter server" that stores and propagates the weight updates. To scale that to a lot of replicas, you can shard the parameters across multiple parameter servers.

有了多个GPU和参数服务器,您会发现自己在做事上要格外小心使用诸如带有tf.device('/ gpu:1')的之类的结构进行设备放置,或使用 tf.train.replica_device_setter 将其分配给 / job:ps / job:worker

With multiple GPUs and parameter servers, you'll find yourself being more careful about device placement using constructs such as with tf.device('/gpu:1'), or placing weights on the parameter servers using tf.train.replica_device_setter to assign it on /job:ps or /job:worker.

一般而言,在一台机器上对一堆GPU进行培训的效率要高得多-它需要16个以上的分布式GPU才能使一台机器上的8个GPU的性能相等-但是分布式培训可让您扩展到更大的数量,并且利用更多的CPU。

In general, training on a bunch of GPUs in a single machine is much more efficient -- it takes more than 16 distributed GPUs to equal the performance of 8 GPUs in a single machine -- but distributed training lets you scale to even larger numbers, and harness more CPU.

这篇关于张量流:多GPU和分布式张量流之间的差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆