在TensorFlow中关闭服务器 [英] Shut down server in TensorFlow

查看:165
本文介绍了在TensorFlow中关闭服务器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我们想使用分布式TensorFlow时,我们将使用

When we want to use distributed TensorFlow, we will create a parameter server using

tf.train.Server.join()

但是,除了杀死进程之外,我找不到关闭服务器的任何方法. join()的TensorFlow文档是

However, I can't find any way to shut down the server except killing the processing. The TensorFlow documentation for join() is

Blocks until the server has shut down.
This method currently blocks forever.

这让我很困扰,因为我想创建许多用于计算的服务器,并在一切完成后将其关闭.

This is quite bothering to me because I would like to create many servers for computation and shut them down when everything finishes.

有没有可能的解决方案.

Is there possible solutions for this.

谢谢

推荐答案

您可以通过使用session.run(dequeue_op)而不是server.join()使参数服务器进程按需终止,并在需要此进程时让另一个进程将某些东西排队到该队列中死.

You can have parameter server processes die on demand by using session.run(dequeue_op) instead of server.join() and having another process enqueue something onto that queue when you want this process to die.

因此对于k参数服务器分片,您可以创建具有唯一shared_name属性的k队列,并尝试从该队列中dequeue.当您要关闭服务器时,可以遍历所有队列并将enqueue令牌循环到每个队列上.这将导致session.run解除阻止,Python进程将运行到最后并退出,从而关闭服务器.

So for k parameter server shards you could create k queues, with unique shared_name property and try to dequeue from that queue. When you want to bring down the servers, you loop over all queues and enqueue a token onto each queue. This would cause session.run to unblock and Python process will run to the end and quit, bringing down the server.

下面是一个独立的示例,其中包含2个分片,它们取自: https://gist.github.com/yaroslavvb/82a5b5302449530ca5ff59df520c369e

Below is a self-contained example with 2 shards taken from: https://gist.github.com/yaroslavvb/82a5b5302449530ca5ff59df520c369e

(有关多工作者/多分片的示例,请参见 https://gist.github.com/yaroslavvb/ea1b1bae0a75c4aae593df7eca72d9ca )

(for multi worker/multi shard example, see https://gist.github.com/yaroslavvb/ea1b1bae0a75c4aae593df7eca72d9ca)

import subprocess
import tensorflow as tf
import time
import sys

flags = tf.flags
flags.DEFINE_string("port1", "12222", "port of worker1")
flags.DEFINE_string("port2", "12223", "port of worker2")
flags.DEFINE_string("task", "", "internal use")
FLAGS = flags.FLAGS

# setup local cluster from flags
host = "127.0.0.1:"
cluster = {"worker": [host+FLAGS.port1, host+FLAGS.port2]}
clusterspec = tf.train.ClusterSpec(cluster).as_cluster_def()

if __name__=='__main__':
  if not FLAGS.task:  # start servers and run client

      # launch distributed service
      def runcmd(cmd): subprocess.Popen(cmd, shell=True, stderr=subprocess.STDOUT)
      runcmd("python %s --task=0"%(sys.argv[0]))
      runcmd("python %s --task=1"%(sys.argv[0]))
      time.sleep(1)

      # bring down distributed service
      sess = tf.Session("grpc://"+host+FLAGS.port1)
      queue0 = tf.FIFOQueue(1, tf.int32, shared_name="queue0")
      queue1 = tf.FIFOQueue(1, tf.int32, shared_name="queue1")
      with tf.device("/job:worker/task:0"):
          add_op0 = tf.add(tf.ones(()), tf.ones(()))
      with tf.device("/job:worker/task:1"):
          add_op1 = tf.add(tf.ones(()), tf.ones(()))

      print("Running computation on server 0")
      print(sess.run(add_op0))
      print("Running computation on server 1")
      print(sess.run(add_op1))

      print("Bringing down server 0")
      sess.run(queue0.enqueue(1))
      print("Bringing down server 1")
      sess.run(queue1.enqueue(1))

  else: # Launch TensorFlow server
    server = tf.train.Server(clusterspec, config=None,
                             job_name="worker",
                             task_index=int(FLAGS.task))
    print("Starting server "+FLAGS.task)
    sess = tf.Session(server.target)
    queue = tf.FIFOQueue(1, tf.int32, shared_name="queue"+FLAGS.task)
    sess.run(queue.dequeue())
    print("Terminating server"+FLAGS.task)

这篇关于在TensorFlow中关闭服务器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆