如果未明确指定，TensorFlow集群如何在计算机之间分配负载? [英] How does TensorFlow cluster distribute load across machines if not specified explicitly?

查看：85 发布时间：2020/6/18 19:10:29 tensorflow distributed-computing horizontal-scaling

本文介绍了如果未明确指定，TensorFlow集群如何在计算机之间分配负载?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我采用了分布式TensorFlow" 的方法，并尝试将其应用于"MNIST对于ML初学者" 教程.我在本地启动了三个TensorFlow worker节点(PC中有8个内核)，并运行了替换以下行的训练脚本:

I took "Distributed TensorFlow" how-to and tried to apply it to the "MNIST For ML Beginners" tutorial. I started three TensorFlow worker nodes locally (there are 8 cores in the PC) and ran the training script with replacing this line:

sess = tf.InteractiveSession()

具有以下内容:

sess = tf.InteractiveSession("grpc://localhost:12345")

其中12346是节点0正在侦听的端口(例如，主会话在节点0上创建).请注意，我没有明确指定应该在哪里执行计算.

where 12346 is a port where node 0 is listening (e.g. master session is created on node 0). Note that I did not specify explicitly where computations should be performed.

查看htop的输出，我可以看到该作业确实是由集群执行的-它消耗了一些CPU.但是，唯一的使用者是节点0，其余节点不执行任何工作.如果我选择节点1作为创建主会话的位置，则图片会发生变化:只有大约2/3的工作在节点0上执行(根据CPU负载判断)，而其余的1/3的工作则在节点1上执行.如果我选择节点2作为主节点，那么将在节点2上执行1/3的工作.如果我并行运行两个进程，一个使用节点1作为主节点，另一个使用节点2作为主节点，则节点1和2都将运行.得到一些负载，但是节点0的负载更多(例如200％vs 60％vs 60％的CPU).

Looking at htop's output, I can see that the job is indeed performed by the cluster - it consumes some CPU. However, the only consumer is node 0, remaining nodes do not perform any work. If I select node 1 as a place to create master session, picture changes: only ~2/3 of the work is performed on node 0 (judging by CPU load), but the remaining 1/3 of the work is performed on node 1. If I select node 2 as master, then that 1/3 of the work is performed on node 2. If I run two processes in parallel, one using node 1 as master and another using node 2 as master, both nodes 1 and 2 get some load, but node 0 is loaded much more (like, 200% vs 60% vs 60% of CPU).

到目前为止，看起来分布式TensorFlow的默认"行为目前尚不适用于自动并行化工作.我想知道行为是什么，以及分布式TensorFlow是否完全用于数据并行化(与手动模型并行化相反)?

So far it looks like "default" behavior of distributed TensorFlow is not great for parallelizing work automatically right now. I'm wondering what the behavior is and whether distributed TensorFlow is intended for data parallelization at all (as opposed to manual model parallelization)?

如果未明确指定，TensorFlow集群如何在计算机之间分配负载? [英] How does TensorFlow cluster distribute load across machines if not specified explicitly?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如果未明确指定，TensorFlow集群如何在计算机之间分配负载? [英] How does TensorFlow cluster distribute load across machines if not specified explicitly?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭