分布式Tensorflow:CreateSession仍在等待 [英] Distributed Tensorflow: CreateSession still waiting

查看:187
本文介绍了分布式Tensorflow:CreateSession仍在等待的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

下面的简单脚本启动,其标题中显示了args。它的行为有所不同,但通常其中一个工作程序挂起并打印这些 CreateSession仍在等待其他任务消息。为什么新的MonitoredTrainingSession需要其他人?为什么其他人不等它开始呢?

 ##!/ bin / bash 
#python train.py --job master --task 0&
#python train.py-作业工人-任务0&
#python train.py-作业工人-任务1&
#python train.py-作业工人-任务2&
导入argparse
导入张量流为tf

parser = argparse.ArgumentParser()
parser.add_argument('-job',type = str)
parser.add_argument('-task',type = int)
args = parser.parse_args()
hosts = {
master:[
localhost:2222 ,
],
worker:[
localhost:2223,
localhost:2224,
localhost:2225,
]
}

nworkers = len(hosts ['worker'])
cluster = tf.train.ClusterSpec(hosts)
server = tf.train。服务器(群集,job_name = args.job,task_index = args.task)

和tf.device(f'/ job:master / task:0'):
global_step = tf。 train.get_or_create_global_step()
inc_global_step = tf.assign(global_step,global_step + 1)

如果args.job =='worker':
hooks = [
tf.train.StopAtStepHook(last_step = 4),
]
和tf.train.MonitoredTrainingSession(master = server.target,
is_chief =(args.task == 0),
hooks = hooks)as sess:
而没有sess.should_stop():
print(args.task,sess.run(inc_global_step) )
其他:
server.join()

它可以等待负责初始化它的变量。但是它碰巧也在等待另一位非首席工人。那么,MonitoredTrainingSession是否同步任务?如果不是,那么FIFOQueues是唯一执行手动同步的原语吗?

解决方案

默认情况下,将尝试分布式TensorFlow会话连接到 tf.train.ClusterSpec 中命名的所有服务器,并且将阻塞直到它们响应为止。这提供了一个有用的障碍,可确保所有工作人员在将控制权还给用户之前已经准备好接收计算请求。此障碍发生在 MonitoredTrainingSession 代码等待负责人初始化变量之前。



如果您不想您的会话在所有服务器上等待(例如,仅等待 / job:ps 中的任务,而不等待 / job:worker中的其他任务 ,这是一种常见的图形间部署策略),最简单的选择是在创建会话时指定设备过滤器。设备筛选器是(部分)设备规格的白名单,用于确定 tf.Session 在启动时将联系哪些任务。例如, mnist_replica.py 测试指定设备过滤器,该设备过滤器是用于配置会话的 tf.ConfigProto 的一部分。 / p>

Simple script below is launched with args shown in it's header. It behaves differently, but often one of the workers hangs and prints these "CreateSession still waiting for some other task" messages. Why does a new MonitoredTrainingSession need others? And why don't the others wait for it to start?

# #!/bin/bash
# python train.py --job master --task 0 &
# python train.py --job worker --task 0 &
# python train.py --job worker --task 1 &
# python train.py --job worker --task 2 &
import argparse
import tensorflow as tf

parser = argparse.ArgumentParser()
parser.add_argument('--job', type=str)
parser.add_argument('--task', type=int)
args = parser.parse_args()
hosts = {
    "master": [
        "localhost:2222",
    ],
    "worker": [
        "localhost:2223",
        "localhost:2224",
        "localhost:2225",
    ]
}

nworkers = len(hosts['worker'])
cluster = tf.train.ClusterSpec(hosts)
server = tf.train.Server(cluster, job_name=args.job, task_index=args.task)

with tf.device(f'/job:master/task:0'):
    global_step = tf.train.get_or_create_global_step()
    inc_global_step = tf.assign(global_step, global_step + 1)

if args.job == 'worker':
    hooks = [
        tf.train.StopAtStepHook(last_step=4),
    ]
    with tf.train.MonitoredTrainingSession(master=server.target,
                                           is_chief=(args.task == 0),
                                           hooks=hooks) as sess:
        while not sess.should_stop():
            print(args.task, sess.run(inc_global_step))
else:
    server.join()

It could wait for the chief to init it's variables. But it happens to wait for another non-chief worker too. So, does MonitoredTrainingSession synchronise tasks? If it doesn't, are FIFOQueues the only primitive to do manual synchronisation?

解决方案

By default, a distributed TensorFlow session will attempt to connect to all servers named in the tf.train.ClusterSpec, and will block until they respond. This provides a useful barrier that ensures that all workers have become ready to receive computation requests before returning control to the user. This barrier happens before the MonitoredTrainingSession code that waits for the chief to initialize variables.

If you don't want your session to wait on all servers (e.g. just wait on tasks in "/job:ps" and not the other tasks in "/job:worker", which is a common between-graph deployment strategy), the easiest option is to specify a "device filter" when you create your session. The device filter is a whitelist of (partial) device specifications that determines which tasks a tf.Session will contact at startup. For example, the mnist_replica.py test specifies a device filter as part of the tf.ConfigProto that is used to configure the session.

这篇关于分布式Tensorflow:CreateSession仍在等待的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆