分布式tensorflow复制训练示例:grpc_tensorflow_server-没有这样的文件或目录 [英] Distributed tensorflow replicated training example: grpc_tensorflow_server - No such file or directory

查看:212
本文介绍了分布式tensorflow复制训练示例:grpc_tensorflow_server-没有这样的文件或目录的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图按照此博客中的说明进行分布式tensorflow 实现: Leo K. Tam 的分布式TensorFlow。我的目标是执行帖子中所述的复制培训 / p>

我已完成步骤,直到安装tensorflow 并成功运行以下命令并获得结果:

  sudo bazel-bin / tensorflow / cc / tutorials_example_trainer --use_gpu 

接下来,我要实现的是通过以下命令在其中一个节点上启动 gRPC服务器

  bazel-bin / tensorflow / core / distributed_runtime / rpc / grpc_tensorflow_server --cluster_spec ='worker | 192.168.555.254:2500; 192.168.555.255:2501'--job_name = worker --task_id = 0& 

尽管在运行它时,出现以下错误: rpc / grpc_tensorflow_server:没有这样的文件目录

  -bash:bazel-bin / tensorflow / core / distributed_runtime / rpc / grpc_tensorflow_server:没有这样的文件或目录

我的 rpc 文件夹为:

  libgrpc_channel.pic.a libgrpc_remote_master.pic.lo libgrpc_session.pic.lo libgrpc_worker_service_impl。 pic.a _objs / 
libgrpc_master_service_impl.pic.a libgrpc_remote_worker.pic.a libgrpc_tensor_coding.pic.a libgrpc_worker_service.pic.a
libgrpc_master_service.pic.lo libgrpc_server_lib.pic.lo libgrpc_der.mg pic.a

我显然在两者之间缺少一个步骤,博客中没有提到。我的目标是能够运行上述命令(启动 gRPC服务器),以便可以在其中一个节点上启动工作进程。

解决方案

grpc_tensorflow_server 二进制文件是在发行版的分布式TensorFlow,默认不再构建或包含在二进制发行版中。替换为 tf.train .Server Python类,它更具可编程性且易于使用。



您可以使用<$ c编写简单的Python脚本$ c> tf.train.Server 重现 grpc_tensorflow_server 的行为:

 #ps.py。在192.168.0.1上运行。 (IP地址已更改为有效。)
导入张量流为tf
服务器= tf.train.Server({ ps:[ 192.168.0.1:2222]},
{ worker:[ 192.168.0.2:2222, 192.168.0.3:2222\"]},
job_name = ps,task_index = 0)
server.join()

#worker_0.py。在192.168.0.2上运行。
进口张量流为tf
服务器= tf.train.Server({ ps:[ 192.168.0.1:2222]},
{ worker:[ 192.168.0.2 :2222, 192.168.0.3:2222\"]}、
job_name = worker,task_index = 0)
server.join()

#worker_1.py。在192.168.0.3上运行。 (IP地址已更改为有效。)
导入张量流为tf
服务器= tf.train.Server({ ps:[ 192.168.0.1:2222]},
{ worker:[ 192.168.0.2:2222, 192.168.0.3:2222\"]},
job_name = worker,task_index = 1)
server.join()

很显然,可以清除此示例并使用命令行标志等使其可重用,但是TensorFlow并未规定这些的特殊形式。需要注意的主要事情是(i)每个TensorFlow任务有一个 tf.train.Server 实例,(ii)所有 Server 实例必须使用相同的集群定义(将作业名称映射到地址列表的字典)构造,并且(iii)每个任务由一对唯一的 job_name task_index



一旦在各自的计算机上运行了三个脚本,就可以创建另一个连接到它们的脚本:

  import tensorflow as tf 

sess = tf.Session( grpc://192.168.0.2:2222)
#...


I am trying to make a distributed tensorflow implementation by following the instructions in this blog: Distributed TensorFlow by Leo K. Tam. My aim is to perform replicated training as mentioned in this post

I have completed the steps till installing tensorflow and successfully running the following command and getting results:

sudo bazel-bin/tensorflow/cc/tutorials_example_trainer --use_gpu

Now the next thing, which I want to implement is to launch the gRPC server on one of the nodes by the following command :

bazel-bin/tensorflow/core/distributed_runtime/rpc/grpc_tensorflow_server --cluster_spec='worker|192.168.555.254:2500;192.168.555.255:2501' --job_name=worker --task_id=0 &

Though, when I run it, I get the following error: rpc/grpc_tensorflow_server:No such file directory

-bash: bazel-bin/tensorflow/core/distributed_runtime/rpc/grpc_tensorflow_server: No such file or directory

The contents of my rpc folder are:

 libgrpc_channel.pic.a              libgrpc_remote_master.pic.lo       libgrpc_session.pic.lo             libgrpc_worker_service_impl.pic.a  _objs/                             
 libgrpc_master_service_impl.pic.a  libgrpc_remote_worker.pic.a        libgrpc_tensor_coding.pic.a        libgrpc_worker_service.pic.a       
 libgrpc_master_service.pic.lo      libgrpc_server_lib.pic.lo          libgrpc_worker_cache.pic.a         librpc_rendezvous_mgr.pic.a

I am clearly missing out on a step in between, which is not mentioned in the blog. My objective is to be able to run the command mentioned above (to launch the gRPC server) so that I can start a worker process on one of the nodes.

解决方案

The grpc_tensorflow_server binary was a temporary measure used in the pre-released version of Distributed TensorFlow, and it is no longer built by default or included in the binary distributions. Its replacement is the tf.train.Server Python class, which is more programmable and easier to use.

You can write simple Python scripts using tf.train.Server to reproduce the behavior of grpc_tensorflow_server:

# ps.py. Run this on 192.168.0.1. (IP addresses changed to be valid.)
import tensorflow as tf
server = tf.train.Server({"ps": ["192.168.0.1:2222"]},
                         {"worker": ["192.168.0.2:2222", "192.168.0.3:2222"]},
                         job_name="ps", task_index=0)
server.join()

# worker_0.py. Run this on 192.168.0.2.
import tensorflow as tf
server = tf.train.Server({"ps": ["192.168.0.1:2222"]},
                         {"worker": ["192.168.0.2:2222", "192.168.0.3:2222"]},
                         job_name="worker", task_index=0)
server.join()

# worker_1.py. Run this on 192.168.0.3. (IP addresses changed to be valid.)
import tensorflow as tf
server = tf.train.Server({"ps": ["192.168.0.1:2222"]},
                         {"worker": ["192.168.0.2:2222", "192.168.0.3:2222"]},
                         job_name="worker", task_index=1)
server.join()

Clearly this example could be cleaned up and made reusable with command-line flags etc., but TensorFlow doesn't prescribe a particular form for these. The main things to note is that (i) there is one tf.train.Server instance per TensorFlow task, (ii) all Server instances must be constructed with the same "cluster definition" (the dictionary mapping job names to lists of addressess), and (iii) each task is identified by a unique pair of job_name and task_index.

Once you run the three scripts on the respective machines,, you can create another script to connect to them:

import tensorflow as tf

sess = tf.Session("grpc://192.168.0.2:2222")
# ...

这篇关于分布式tensorflow复制训练示例:grpc_tensorflow_server-没有这样的文件或目录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆