分布式Tensorflow Estimator执行不会触发评估或导出 [英] Distributed Tensorflow Estimator execution does not trigger evaluation or export

查看:167
本文介绍了分布式Tensorflow Estimator执行不会触发评估或导出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用tensorflow估计器测试分布式培训。在我的示例中,我使用 tf.estimator.train_and_evaluation 。经过培训和评估后,我想导出模型以准备好进行 tensorflow服务。但是,只有在以非分布式方式执行估算器时才会触发估算和导出。

I am testing distributed training with tensorflow Estimators. In my example I fit a simple sinus function with a custom estimator using tf.estimator.train_and_evaluation. After training and evaluation I want to export the model to have it ready for tensorflow serving. However the evaluation and export is only triggered when executing the estimator in non-distributed way.

模型和估算器的定义如下:

The model and Estimators are defined as follows:

def my_model(features, labels, mode):
    # define simple dense network
    net = tf.layers.dense(features['x'], units=8, activation=tf.nn.tanh)
    net = tf.layers.dense(net, units=8, activation=tf.nn.tanh)
    net = tf.layers.dense(net, units=8, activation=tf.nn.tanh)
    net = tf.layers.dense(net, units=8, activation=tf.nn.tanh)
    net = tf.layers.dense(net, units=8, activation=tf.nn.tanh)
    net = tf.layers.dense(net, units=8, activation=tf.nn.tanh)
    net = tf.layers.dense(net, units=8, activation=tf.nn.tanh)
    net = tf.layers.dense(net, units=8, activation=tf.nn.tanh)

    # output layer
    predictions = tf.layers.dense(net, units=1, activation=tf.nn.tanh)

    if mode == tf.estimator.ModeKeys.PREDICT:
        # define output message for tensorflow serving
        export_outputs = {'predict_output': tf.estimator.export.PredictOutput({"predictions": predictions})}

        return tf.estimator.EstimatorSpec(mode=mode, predictions={'predictions': predictions}, export_outputs=export_outputs)
    elif mode == tf.estimator.ModeKeys.EVAL:
        # for evaluation simply use mean squared error
        loss = tf.losses.mean_squared_error(labels=labels, predictions=predictions)
        metrics = {'mse': tf.metrics.mean_squared_error(labels, predictions)}

        return tf.estimator.EstimatorSpec(mode, loss=loss, eval_metric_ops=metrics)
    elif mode == tf.estimator.ModeKeys.TRAIN:
        # train on mse with Adagrad optimizer
        loss = tf.losses.mean_squared_error(labels=labels, predictions=predictions)
        optimizer = tf.train.AdagradOptimizer(learning_rate=0.1)
        train_op = optimizer.minimize(loss, global_step=tf.train.get_global_step())

        return tf.estimator.EstimatorSpec(mode, loss=loss, train_op=train_op)
    else:
        raise ValueError("unhandled mode: %s" % str(mode))


def main(_):
    # prepare training data
    default_batch_size = 50
    examples = [{'x': x, 'y': math.sin(x)} for x in [random.random()*2*math.pi for _ in range(10000)]]

    estimator = tf.estimator.Estimator(model_fn=my_model,
                                       config=tf.estimator.RunConfig(model_dir='sin_model',
                                                                     save_summary_steps=100))

    # function converting examples to dataset
    def dataset_fn():
        # returns a dataset serving batched (feature_map, label)-pairs
        # e.g. ({'x': [1.0, 0.3, 1.1...]}, [0.84, 0.29, 0.89...])
        return tf.data.Dataset.from_generator(
            lambda: iter(examples),
            output_types={"x": tf.float32, "y": tf.float32},
            output_shapes={"x": [], "y": []}) \
            .map(lambda x: ({'x': [x['x']]}, [x['y']])) \
            .repeat() \
            .batch(default_batch_size)

    # function to export model to be used for serving
    feature_spec = {'x': tf.FixedLenFeature([1], tf.float32)}
    def serving_input_fn():
        serialized_tf_example = tf.placeholder(dtype=tf.string, shape=[default_batch_size])

        receiver_tensors = {'examples': serialized_tf_example}
        features = tf.parse_example(serialized_tf_example, feature_spec)
        return tf.estimator.export.ServingInputReceiver(features, receiver_tensors)

    # train, evaluate and export
    train_spec = tf.estimator.TrainSpec(input_fn=dataset_fn, max_steps=1000)
    eval_spec = tf.estimator.EvalSpec(input_fn=dataset_fn,
                                      steps=100,
                                      exporters=[tf.estimator.FinalExporter('sin', serving_input_fn)])

    tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)

if __name__ == '__main__':
    tf.app.run(main)

在单个过程中执行此代码时,我收到一个包含模型检查点,评估值的输出文件夹数据和模型导出

When executing this code in one single process, I receive an output folder that contains model checkpoints, evaluation data and the model export

$ ls sin_model/
checkpoint                                  model.ckpt-0.index
eval                                        model.ckpt-0.meta
events.out.tfevents.1532426226.simon        model.ckpt-1000.data-00000-of-00001
export                                      model.ckpt-1000.index
graph.pbtxt                                 model.ckpt-1000.meta
model.ckpt-0.data-00000-of-00001

但是,在分发训练过程时(在此测试中

However, when distributing the training process (in this test setup only on local machine) the eval and export folders are missing.

我使用以下群集配置启动各个节点:

I start the individual nodes using the following cluster config:

{"cluster": {
    "ps": ["localhost:2222"],
    "chief": ["localhost:2223"], 
    "worker": ["localhost:2224"]
}

ps服务器的启动如下所示

The starting of ps server looks as follows

$ TF_CONFIG='{"cluster": {"chief": ["localhost:2223"], "worker": ["localhost:2224"], "ps": ["localhost:2222"]}, "task": {"type": "ps", "index": 0}}' CUDA_VISIBLE_DEVICES= python custom_estimator.py
2018-07-24 12:09:04.913967: E tensorflow/stream_executor/cuda/cuda_driver.cc:397] failed call to cuInit: CUDA_ERROR_NO_DEVICE
2018-07-24 12:09:04.914008: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:132] retrieving CUDA diagnostic information for host: simon
2018-07-24 12:09:04.914013: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:139] hostname: simon
2018-07-24 12:09:04.914035: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:163] libcuda reported version is: 384.130.0
2018-07-24 12:09:04.914059: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:167] kernel reported version is: 384.130.0
2018-07-24 12:09:04.914079: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:249] kernel version seems to match DSO: 384.130.0
2018-07-24 12:09:04.914961: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job chief -> {0 -> localhost:2223}
2018-07-24 12:09:04.914971: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:2222}
2018-07-24 12:09:04.914976: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:2224}
2018-07-24 12:09:04.915658: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:369] Started server with target: grpc://localhost:2222

(我在命令行后附加了 CUDA_VISIBLE_DEVICES = 来防止工人和负责人分配GPU内存。这会导致cuInit的调用失败:CUDA_ERROR_NO_DEVICE 错误,但这不是很严重)

(I appended CUDA_VISIBLE_DEVICES= to the command line to prevent the worker and chief to allocate GPU memory. This causes an failed call to cuInit: CUDA_ERROR_NO_DEVICE Error which is however not critical)

酋长然后开始如下

$ TF_CONFIG='{"cluster": {"chief": ["localhost:2223"], "worker": ["localhost:2224"], "ps": ["localhost:2222"]}, "task": {"type": "chief", "index": 0}}' CUDA_VISIBLE_DEVICES= python custom_estimator.py
2018-07-24 12:09:10.532171: E tensorflow/stream_executor/cuda/cuda_driver.cc:397] failed call to cuInit: CUDA_ERROR_NO_DEVICE
2018-07-24 12:09:10.532234: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:132] retrieving CUDA diagnostic information for host: simon
2018-07-24 12:09:10.532241: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:139] hostname: simon
2018-07-24 12:09:10.532298: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:163] libcuda reported version is: 384.130.0
2018-07-24 12:09:10.532353: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:167] kernel reported version is: 384.130.0
2018-07-24 12:09:10.532359: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:249] kernel version seems to match DSO: 384.130.0
2018-07-24 12:09:10.533195: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job chief -> {0 -> localhost:2223}
2018-07-24 12:09:10.533207: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:2222}
2018-07-24 12:09:10.533211: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:2224}
2018-07-24 12:09:10.533835: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:369] Started server with target: grpc://localhost:2223
2018-07-24 12:09:14.038636: I tensorflow/core/distributed_runtime/master_session.cc:1165] Start master session 71a2748ad69725ae with config: allow_soft_placement: true graph_options { rewrite_options { meta_optimizer_iterations: ONE } }

然后按如下方式启动工作程序:

And the worker is then started as follows:

$ TF_CONFIG='{"cluster": {"chief": ["localhost:2223"], "worker": ["localhost:2224"], "ps": ["localhost:2222"]}, "task": {"type": "worker", "index": 0}}' CUDA_VISIBLE_DEVICES= python custom_estimator.py
2018-07-24 12:09:13.172260: E tensorflow/stream_executor/cuda/cuda_driver.cc:397] failed call to cuInit: CUDA_ERROR_NO_DEVICE
2018-07-24 12:09:13.172320: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:132] retrieving CUDA diagnostic information for host: simon
2018-07-24 12:09:13.172327: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:139] hostname: simon
2018-07-24 12:09:13.172362: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:163] libcuda reported version is: 384.130.0
2018-07-24 12:09:13.172399: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:167] kernel reported version is: 384.130.0
2018-07-24 12:09:13.172405: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:249] kernel version seems to match DSO: 384.130.0
2018-07-24 12:09:13.173230: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job chief -> {0 -> localhost:2223}
2018-07-24 12:09:13.173242: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:2222}
2018-07-24 12:09:13.173247: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:2224}
2018-07-24 12:09:13.173783: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:369] Started server with target: grpc://localhost:2224
2018-07-24 12:09:18.774264: I tensorflow/core/distributed_runtime/master_session.cc:1165] Start master session 1d13ac84816fdc80 with config: allow_soft_placement: true graph_options { rewrite_options { meta_optimizer_iterations: ONE } }

不久后,主要流程停止,并且 sin_model 文件夹与模型一起存在检查点,但没有导出或评估:

After a short time the chief process stops and the sin_model folder exists with model checkpoint but no export or evaluation:

$ ls sin_model/
checkpoint                                  model.ckpt-0.meta
events.out.tfevents.1532426950.simon        model.ckpt-1001.data-00000-of-00001
graph.pbtxt                                 model.ckpt-1001.index
model.ckpt-0.data-00000-of-00001            model.ckpt-1001.meta
model.ckpt-0.index

还有其他配置吗

我正在使用python 3.5和tensorflow 1.8

I am working with python 3.5 and tensorflow 1.8

推荐答案

在分布式模式下,您可以通过将任务 type 设置为 evaluator

On distributed mode, you can run evaluations in parallel to training by setting the task type to evaluator:

{
   "cluster": {
     "ps": ["localhost:2222"],
     "chief": ["localhost:2223"], 
     "worker": ["localhost:2224"]
   },
   "task": {
     "type": "evaluator", "index": 0
   },
   "environment": "cloud"
}

您无需在集群定义中定义 evaluator 。另外,不确定这是否与您的情况有关,但是也许在集群配置中设置 environment:'cloud'可能会有所帮助。

You don't need to define evaluator within your cluster definition. Also, not sure if this is related to your case, but maybe setting environment: 'cloud' in your cluster config might help.

这篇关于分布式Tensorflow Estimator执行不会触发评估或导出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆