TensorFlow:极简程序在分布式模式下失败 [英] TensorFlow: minimalist program fails on distributed mode

查看:35
本文介绍了TensorFlow:极简程序在分布式模式下失败的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我写了一个非常简单的程序,它在没有分发的情况下运行得很好,但在分布式模式下挂在 CheckpointSaverHook 上(尽管我的本地主机上的一切!).我看到有一些关于在分布式模式下挂起的问题,但似乎没有一个与我的问题相符.

I wrote a very simple program that runs just fine without distribution but hangs on CheckpointSaverHook in distributed mode (everything on my localhost though!). I've seen there's been a few questions about hanging in distributed mode, but none seem to match my question.

这是脚本(使用新的图层 API 制作):

Here's the script (made to toy with the new layers API):

import numpy as np
import tensorflow as tf
from tensorflow.contrib.learn.python.learn import learn_runner
from tensorflow.contrib import layers

DATA_SIZE=10
DIMENSION=5
FEATURES='features'

def generate_input_fn():
    def _input_fn():
        mid = int(DATA_SIZE/2)

        data = np.array([np.ones(DIMENSION) if x < mid else -np.ones(DIMENSION) for x in range(DATA_SIZE)])
        labels = ['0' if x < mid else '1' for x in range(DATA_SIZE)]

        table = tf.contrib.lookup.string_to_index_table_from_tensor(tf.constant(['0', '1']))
        label_tensor = table.lookup(tf.convert_to_tensor(labels, dtype=tf.string))

        return dict(zip([FEATURES], [tf.convert_to_tensor(data, dtype=tf.float32)])), label_tensor
    return _input_fn

def build_estimator(model_dir):
    features = layers.real_valued_column(FEATURES, dimension=DIMENSION)
    return tf.contrib.learn.DNNLinearCombinedClassifier(
        model_dir=model_dir,
        dnn_feature_columns=[features],
        dnn_hidden_units=[20,20])

def generate_exp_fun():
    def _exp_fun(output_dir):
        return tf.contrib.learn.Experiment(
            build_estimator(output_dir),
            train_input_fn=generate_input_fn(),
            eval_input_fn=generate_input_fn(),
            train_steps=100
        )
    return _exp_fun

if __name__ == '__main__':
    tf.logging.set_verbosity(tf.logging.DEBUG)
    learn_runner.run(generate_exp_fun(), 'job_dir')

要测试分布式模式,我只需使用环境变量启动它 TF_CONFIG={"cluster": {"ps":["localhost:5040"], "worker":["localhost:5041"]}, "task":{"type":"worker","index":0}, "environment": "local"}(这是给worker的,同pscode> 类型用于启动参数服务器.

To test distributed mode, I simply launch it with the environment variable TF_CONFIG={"cluster": {"ps":["localhost:5040"], "worker":["localhost:5041"]}, "task":{"type":"worker","index":0}, "environment": "local"} (this is for the worker, the same with ps type is used to launch the parameter server.

我在 windows-64 上使用 tensorflow-1.0.1(但与 1.0.0 具有相同的行为),仅使用 CPU.我实际上从未收到任何错误,它只是在 INFO:tensorflow:Create CheckpointSaverHook. 之后一直挂起.我无法打印本机部分发生的事情的堆栈.

I use tensorflow-1.0.1 (but had the same behavior with 1.0.0) on windows-64, only CPU. I actually never get any error, it just hang on after INFO:tensorflow:Create CheckpointSaverHook. forever... I've tried to attach VisualStudio C++ debugger to the process but with little success so far, so I can't print a stack for what's happening in the native part.

P.S.:DNNLinearCombinedClassifier 不是问题,因为使用简单的 tf.contrib.learn.LinearClassifier 也会失败.正如评论中所指出的,这不是因为两个进程都在 localhost 上运行,因为它在单独的 VM 上运行时也会失败.

P.S.: it's not a problem with DNNLinearCombinedClassifier because it fails as well with a simple tf.contrib.learn.LinearClassifier. And as noted in the comments, it's not due to both process running on localhost, since it fails also when running on separate VMs.

我认为服务器启动实际上存在问题.当您处于本地模式时(无论是否分布式),服务器似乎未启动,参见.tensorflow/contrib/learn/python/learn/experiment.py l.250-258:

I think there's actually an issue with server launching. It looks like the server is not launched when you're in local mode (no matter if distributed or not), cf. tensorflow/contrib/learn/python/learn/experiment.py l.250-258:

# Start the server, if needed. It's important to start the server before
# we (optionally) sleep for the case where no device_filters are set.
# Otherwise, the servers will wait to connect to each other before starting
# to train. We might as well start as soon as we can.
config = self._estimator.config
if (config.environment != run_config.Environment.LOCAL and
    config.environment != run_config.Environment.GOOGLE and
    config.cluster_spec and config.master):
  self._start_server()

这将阻止服务器以本地模式为工作人员启动......有人知道这是一个错误还是我遗漏了什么?

This will prevent the server from being started in local mode for the workers... Anyone has an idea if it's a bug or there's something I'm missing?

推荐答案

So this has been answer in: https://github.com/tensorflow/tensorflow/issues/8796.最后,任何分布式操作都应该使用CLOUD.

So this has been answered in: https://github.com/tensorflow/tensorflow/issues/8796. Finally, one should use CLOUD for any distributed operation.

这篇关于TensorFlow:极简程序在分布式模式下失败的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆