如果未明确指定,TensorFlow集群如何在计算机之间分配负载? [英] How does TensorFlow cluster distribute load across machines if not specified explicitly?

查看:85
本文介绍了如果未明确指定,TensorFlow集群如何在计算机之间分配负载?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我采用了分布式TensorFlow" 的方法,并尝试将其应用于"MNIST对于ML初学者" 教程.我在本地启动了三个TensorFlow worker节点(PC中有8个内核),并运行了替换以下行的训练脚本:

I took "Distributed TensorFlow" how-to and tried to apply it to the "MNIST For ML Beginners" tutorial. I started three TensorFlow worker nodes locally (there are 8 cores in the PC) and ran the training script with replacing this line:

sess = tf.InteractiveSession()

具有以下内容:

sess = tf.InteractiveSession("grpc://localhost:12345")

其中12346是节点0正在侦听的端口(例如,主会话在节点0上创建).请注意,我没有明确指定应该在哪里执行计算.

where 12346 is a port where node 0 is listening (e.g. master session is created on node 0). Note that I did not specify explicitly where computations should be performed.

查看htop的输出,我可以看到该作业确实是由集群执行的-它消耗了一些CPU.但是,唯一的使用者是节点0,其余节点不执行任何工作.如果我选择节点1作为创建主会话的位置,则图片会发生变化:只有大约2/3的工作在节点0上执行(根据CPU负载判断),而其余的1/3的工作则在节点1上执行.如果我选​​择节点2作为主节点,那么将在节点2上执行1/3的工作.如果我并行运行两个进程,一个使用节点1作为主节点,另一个使用节点2作为主节点,则节点1和2都将运行.得到一些负载,但是节点0的负载更多(例如200%vs 60%vs 60%的CPU).

Looking at htop's output, I can see that the job is indeed performed by the cluster - it consumes some CPU. However, the only consumer is node 0, remaining nodes do not perform any work. If I select node 1 as a place to create master session, picture changes: only ~2/3 of the work is performed on node 0 (judging by CPU load), but the remaining 1/3 of the work is performed on node 1. If I select node 2 as master, then that 1/3 of the work is performed on node 2. If I run two processes in parallel, one using node 1 as master and another using node 2 as master, both nodes 1 and 2 get some load, but node 0 is loaded much more (like, 200% vs 60% vs 60% of CPU).

到目前为止,看起来分布式TensorFlow的默认"行为目前尚不适用于自动并行化工作.我想知道行为是什么,以及分布式TensorFlow是否完全用于数据并行化(与手动模型并行化相反)?

So far it looks like "default" behavior of distributed TensorFlow is not great for parallelizing work automatically right now. I'm wondering what the behavior is and whether distributed TensorFlow is intended for data parallelization at all (as opposed to manual model parallelization)?

推荐答案

TF非常适合数据并行化,例如当您需要筛选大量数据时,然后将其分发到多个GPU.

TF is great for data parallelization, e.g. when you need to sift through tons of data, which is then distributed to multiple GPUs.

对于权重并行化也非常有用.使用tf.train.replica_device_setter,可以在多个设备之间分配权重,以获得更好的IO.

It's also great for weights parallelization. Using tf.train.replica_device_setter, weights are distributed among multiple devices for better IO.

现在,您似乎正在要求在单个模型中进行并行化.这很难自动完成,因为TF不知道将相同模型的计算分配到多个设备的最佳方法是什么.这将取决于太多因素,例如您的设备之间的连接速度有多快.

Now, it seems you are asking for parallelization within a single model. That's difficult to do automatically, since TF does not know what's the best way to distribute your computation of the same model to multiple devices. It would depend on too many factors, e.g. how fast is the connection between your devices.

这篇关于如果未明确指定,TensorFlow集群如何在计算机之间分配负载?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆