使用 tf.estimator.Estimator 时步数不匹配 [英] Number of steps doesn't match when using tf.estimator.Estimator

查看:54
本文介绍了使用 tf.estimator.Estimator 时步数不匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究 TensorFlow 估算器框架.我终于有了训练模型的代码.我在测试中使用了一个简单的 MNIST 自动编码器.我有两个问题.第一个问题是为什么训练报告的步数与我在 estimator train() 方法中指定的步数不同?第二个是如何使用训练钩子来做定期评估、每 X 步的损失输出等?文档似乎说要使用训练挂钩,但我似乎找不到任何有关如何使用这些挂钩的实际示例.

I am figuring out the TensorFlow estimator framework. I finally have code for a model that trains. I am using a simple MNIST autoencoder for my tests. I have two questions. The first question is why the number of steps reported by training is different from the number of steps I specify in estimator train() method? The second one is how to use training hooks to do things like periodic evaluations, loss output every X steps etc? The docs seem to say to use training hooks, but I cannot seem to find any actual examples of how to use these.

这是我的代码:

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import time
import shutil
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt

from IPython import display
from tensorflow.examples.tutorials.mnist import input_data

data = input_data.read_data_sets('.')
display.clear_output()

def _model_fn(features, labels, mode=None, params=None):
    # define inputs
    image = tf.feature_column.numeric_column('images', shape=(784, ))
    inputs = tf.feature_column.input_layer(features, [image, ])
    # encoder
    e1 = tf.layers.dense(inputs, 512, activation=tf.nn.relu)
    e2 = tf.layers.dense(e1, 256, activation=tf.nn.relu)
    # decoder
    d1 = tf.layers.dense(e2, 512, activation=tf.nn.relu)
    model = tf.layers.dense(d1, 784, activation=tf.nn.relu)
    # training ops
    loss = tf.losses.mean_squared_error(labels, model)
    train = tf.train.AdamOptimizer().minimize(loss, global_step=tf.train.get_global_step())
    if mode == tf.estimator.ModeKeys.TRAIN:
        return tf.estimator.EstimatorSpec(mode=mode,
                                          loss=loss,
                                          train_op=train)

_train_input_fn = tf.estimator.inputs.numpy_input_fn({'images': data.train.images},
                                                     y=np.array(data.train.images),
                                                     batch_size=100,
                                                     shuffle=True)

shutil.rmtree("logs", ignore_errors=True)
tf.logging.set_verbosity(tf.logging.INFO)
estimator = tf.estimator.Estimator(_model_fn, 
                                   model_dir="logs", 
                                   config=tf.contrib.learn.RunConfig(save_checkpoints_steps=1000),
                                   params={})
estimator.train(_train_input_fn, steps=1000)

这是我得到的输出(注意训练如何在 550 步处停止,其中代码明确要求 1000)

And here is the output I get (notice how training stops at 550 steps where the code explicitely calls for a 1000)

INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x12b9fa630>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': None, '_session_config': None, '_save_checkpoints_steps': 1000, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': 'logs'}
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into logs/model.ckpt.
INFO:tensorflow:loss = 0.102862, step = 1
INFO:tensorflow:global_step/sec: 41.8119
INFO:tensorflow:loss = 0.0191228, step = 101 (2.393 sec)
INFO:tensorflow:global_step/sec: 39.9923
INFO:tensorflow:loss = 0.0141014, step = 201 (2.500 sec)
INFO:tensorflow:global_step/sec: 40.9806
INFO:tensorflow:loss = 0.0116138, step = 301 (2.440 sec)
INFO:tensorflow:global_step/sec: 40.0043
INFO:tensorflow:loss = 0.00998991, step = 401 (2.500 sec)
INFO:tensorflow:global_step/sec: 39.2571
INFO:tensorflow:loss = 0.0124132, step = 501 (2.548 sec)
INFO:tensorflow:Saving checkpoints for 550 into logs/model.ckpt.
INFO:tensorflow:Loss for final step: 0.00940801.

<tensorflow.python.estimator.estimator.Estimator at 0x12b9fa780>

更新 #1 我找到了第一个问题的答案.训练在步骤 550 停止的原因是因为 numpy_input_fn() 默认为 num_epochs=1.不过,我仍在寻求有关训练钩子的帮助.

Update #1 I found the answer to the first question. The reason training stopped at step 550 was because numpy_input_fn() defaults to num_epochs=1. I am still looking for help with training hooks though.

推荐答案

估算器可以在 3 种模式下运行.

The estimator can be run in 3 modes.

  1. 训练
  2. 评价
  3. 预测

您当前的代码仅配置为在训练模式下运行.如果您想包括评估步骤,那么您必须首先对模型函数进行一些更改:

your current code is only configured to run in training mode. If you want to include evaluation step then you have to make some changes into the model function first :

def _model_fn(features, labels, mode=None, params=None):
    # define inputs
    image = tf.feature_column.numeric_column('images', shape=(784, ))
    inputs = tf.feature_column.input_layer(features, [image, ])
    # encoder
    e1 = tf.layers.dense(inputs, 512, activation=tf.nn.relu)
    e2 = tf.layers.dense(e1, 256, activation=tf.nn.relu)
    # decoder
    d1 = tf.layers.dense(e2, 512, activation=tf.nn.relu)
    model = tf.layers.dense(d1, 784, activation=tf.nn.relu)
    # training ops
    loss = tf.losses.mean_squared_error(labels, model)
    train = tf.train.AdamOptimizer().minimize(loss, global_step=tf.train.get_global_step())
    if mode == tf.estimator.ModeKeys.TRAIN:
        return tf.estimator.EstimatorSpec(mode=mode,
                                          loss=loss,
                                          train_op=train)

    prec, prec_update_op = tf.metrics.precision(labels=labels,predictions=model), name='precision_op')
    recall, recall_update_op = tf.metrics.recall(labels=labels, predictions=model, name='recall_op')

    metrics={'recall':(recall, recall_update_op), \
               'precision':(prec, prec_update_op)}

    if mode==tf.estimator.ModeKeys.EVAL:
          return tf.estimator.EstimatorSpec(mode, loss=loss, eval_metric_ops=metrics)

现在每 10 步做一次评估和打印损失输出.

Now to do evaluation and print loss output every 10 steps.

configuration = tf.estimator.RunConfig(
  model_dir = 'logs',
  keep_checkpoint_max=5,
  save_checkpoints_steps=1500,
  log_step_count_steps=10)  # set the frequency of logging steps for loss function

estimator = tf.estimator.Estimator(model_fn = _model_fn, params = {}, config=configuration)

train_spec = tf.estimator.TrainSpec(input_fn=_train_input_fn, steps=5000) 
eval_spec = tf.estimator.EvalSpec(input_fn=_train_input_fn, steps=100, throttle_secs=600)

tf.estimator.train_and_evaluate(classifier, train_spec, eval_spec)

注意:

  1. 在保存每个新检查点后(即每 1500 步),评估运行 100 步,然后继续训练.
  2. log_step_count_steps 每 X 步打印一次损失输出.
  3. 参数 throttle_secs 定义了两个连续评估步骤之间的最小秒数.如果在此秒数之前存储了新的检查点,则跳过评估.
  1. after every new checkpoint is saved (i.e at every 1500 step), evaluation is run for 100 steps and then training resumes.
  2. The log_step_count_steps prints loss output every X steps.
  3. The argument throttle_secs defines the minimum number of seconds between two consecutive evaluation steps. If a new checkpoint is stored before this number of seconds then evaluation is skipped.

以上将在同一数据集上进行训练和评估,如果您希望在不同的数据集上完成,则将其(数据集的)合适的输入函数传递给 input_fn 的参数>tf.estimator.EvalSpec

The above will train and evaluate on the same dataset, if you want it to be done on a different dataset, then pass it's(the dataset's) suitable input function to the argument input_fn at tf.estimator.EvalSpec

这篇关于使用 tf.estimator.Estimator 时步数不匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆