分布式tensorflow复制训练示例:grpc_tensorflow_server-没有这样的文件或目录 [英] Distributed tensorflow replicated training example: grpc_tensorflow_server - No such file or directory
问题描述
我试图按照此博客中的说明进行分布式tensorflow
实现: Leo K. Tam 的分布式TensorFlow。我的目标是执行帖子中所述的复制培训
/ p>
我已完成步骤,直到安装tensorflow
并成功运行以下命令并获得结果:
sudo bazel-bin / tensorflow / cc / tutorials_example_trainer --use_gpu
接下来,我要实现的是通过以下命令在其中一个节点上启动 gRPC服务器
:
bazel-bin / tensorflow / core / distributed_runtime / rpc / grpc_tensorflow_server --cluster_spec ='worker | 192.168.555.254:2500; 192.168.555.255:2501'--job_name = worker --task_id = 0&
尽管在运行它时,出现以下错误: rpc / grpc_tensorflow_server:没有这样的文件目录
-bash:bazel-bin / tensorflow / core / distributed_runtime / rpc / grpc_tensorflow_server:没有这样的文件或目录
我的 rpc
文件夹为:
libgrpc_channel.pic.a libgrpc_remote_master.pic.lo libgrpc_session.pic.lo libgrpc_worker_service_impl。 pic.a _objs /
libgrpc_master_service_impl.pic.a libgrpc_remote_worker.pic.a libgrpc_tensor_coding.pic.a libgrpc_worker_service.pic.a
libgrpc_master_service.pic.lo libgrpc_server_lib.pic.lo libgrpc_der.mg pic.a
我显然在两者之间缺少一个步骤,博客中没有提到。我的目标是能够运行上述命令(启动 gRPC服务器
),以便可以在其中一个节点上启动工作进程。
grpc_tensorflow_server
二进制文件是在发行版的分布式TensorFlow,默认不再构建或包含在二进制发行版中。替换为 tf.train .Server
Python类,它更具可编程性且易于使用。
您可以使用<$ c编写简单的Python脚本$ c> tf.train.Server 重现 grpc_tensorflow_server
的行为:
#ps.py。在192.168.0.1上运行。 (IP地址已更改为有效。)
导入张量流为tf
服务器= tf.train.Server({ ps:[ 192.168.0.1:2222]},
{ worker:[ 192.168.0.2:2222, 192.168.0.3:2222\"]},
job_name = ps,task_index = 0)
server.join()
#worker_0.py。在192.168.0.2上运行。
进口张量流为tf
服务器= tf.train.Server({ ps:[ 192.168.0.1:2222]},
{ worker:[ 192.168.0.2 :2222, 192.168.0.3:2222\"]}、
job_name = worker,task_index = 0)
server.join()
#worker_1.py。在192.168.0.3上运行。 (IP地址已更改为有效。)
导入张量流为tf
服务器= tf.train.Server({ ps:[ 192.168.0.1:2222]},
{ worker:[ 192.168.0.2:2222, 192.168.0.3:2222\"]},
job_name = worker,task_index = 1)
server.join()
很显然,可以清除此示例并使用命令行标志等使其可重用,但是TensorFlow并未规定这些的特殊形式。需要注意的主要事情是(i)每个TensorFlow任务有一个 tf.train.Server
实例,(ii)所有 Server
实例必须使用相同的集群定义(将作业名称映射到地址列表的字典)构造,并且(iii)每个任务由一对唯一的 job_name
和 task_index
。
一旦在各自的计算机上运行了三个脚本,就可以创建另一个连接到它们的脚本:
import tensorflow as tf
sess = tf.Session( grpc://192.168.0.2:2222)
#...
I am trying to make a distributed tensorflow
implementation by following the instructions in this blog: Distributed TensorFlow by Leo K. Tam. My aim is to perform replicated training
as mentioned in this post
I have completed the steps till installing tensorflow
and successfully running the following command and getting results:
sudo bazel-bin/tensorflow/cc/tutorials_example_trainer --use_gpu
Now the next thing, which I want to implement is to launch the gRPC server
on one of the nodes by the following command :
bazel-bin/tensorflow/core/distributed_runtime/rpc/grpc_tensorflow_server --cluster_spec='worker|192.168.555.254:2500;192.168.555.255:2501' --job_name=worker --task_id=0 &
Though, when I run it, I get the following error: rpc/grpc_tensorflow_server:No such file directory
-bash: bazel-bin/tensorflow/core/distributed_runtime/rpc/grpc_tensorflow_server: No such file or directory
The contents of my rpc
folder are:
libgrpc_channel.pic.a libgrpc_remote_master.pic.lo libgrpc_session.pic.lo libgrpc_worker_service_impl.pic.a _objs/
libgrpc_master_service_impl.pic.a libgrpc_remote_worker.pic.a libgrpc_tensor_coding.pic.a libgrpc_worker_service.pic.a
libgrpc_master_service.pic.lo libgrpc_server_lib.pic.lo libgrpc_worker_cache.pic.a librpc_rendezvous_mgr.pic.a
I am clearly missing out on a step in between, which is not mentioned in the blog. My objective is to be able to run the command mentioned above (to launch the gRPC server
) so that I can start a worker process on one of the nodes.
The grpc_tensorflow_server
binary was a temporary measure used in the pre-released version of Distributed TensorFlow, and it is no longer built by default or included in the binary distributions. Its replacement is the tf.train.Server
Python class, which is more programmable and easier to use.
You can write simple Python scripts using tf.train.Server
to reproduce the behavior of grpc_tensorflow_server
:
# ps.py. Run this on 192.168.0.1. (IP addresses changed to be valid.)
import tensorflow as tf
server = tf.train.Server({"ps": ["192.168.0.1:2222"]},
{"worker": ["192.168.0.2:2222", "192.168.0.3:2222"]},
job_name="ps", task_index=0)
server.join()
# worker_0.py. Run this on 192.168.0.2.
import tensorflow as tf
server = tf.train.Server({"ps": ["192.168.0.1:2222"]},
{"worker": ["192.168.0.2:2222", "192.168.0.3:2222"]},
job_name="worker", task_index=0)
server.join()
# worker_1.py. Run this on 192.168.0.3. (IP addresses changed to be valid.)
import tensorflow as tf
server = tf.train.Server({"ps": ["192.168.0.1:2222"]},
{"worker": ["192.168.0.2:2222", "192.168.0.3:2222"]},
job_name="worker", task_index=1)
server.join()
Clearly this example could be cleaned up and made reusable with command-line flags etc., but TensorFlow doesn't prescribe a particular form for these. The main things to note is that (i) there is one tf.train.Server
instance per TensorFlow task, (ii) all Server
instances must be constructed with the same "cluster definition" (the dictionary mapping job names to lists of addressess), and (iii) each task is identified by a unique pair of job_name
and task_index
.
Once you run the three scripts on the respective machines,, you can create another script to connect to them:
import tensorflow as tf
sess = tf.Session("grpc://192.168.0.2:2222")
# ...
这篇关于分布式tensorflow复制训练示例:grpc_tensorflow_server-没有这样的文件或目录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!