使用 tf.estimator.Estimator 时步数不匹配 [英] Number of steps doesn't match when using tf.estimator.Estimator

查看：54 发布时间：2021/9/5 19:51:03 tensorflow

本文介绍了使用 tf.estimator.Estimator 时步数不匹配的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在研究 TensorFlow 估算器框架.我终于有了训练模型的代码.我在测试中使用了一个简单的 MNIST 自动编码器.我有两个问题.第一个问题是为什么训练报告的步数与我在 estimator train() 方法中指定的步数不同?第二个是如何使用训练钩子来做定期评估、每 X 步的损失输出等?文档似乎说要使用训练挂钩，但我似乎找不到任何有关如何使用这些挂钩的实际示例.

I am figuring out the TensorFlow estimator framework. I finally have code for a model that trains. I am using a simple MNIST autoencoder for my tests. I have two questions. The first question is why the number of steps reported by training is different from the number of steps I specify in estimator train() method? The second one is how to use training hooks to do things like periodic evaluations, loss output every X steps etc? The docs seem to say to use training hooks, but I cannot seem to find any actual examples of how to use these.

这是我的代码:

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import time
import shutil
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt

from IPython import display
from tensorflow.examples.tutorials.mnist import input_data

data = input_data.read_data_sets('.')
display.clear_output()

def _model_fn(features, labels, mode=None, params=None):
    # define inputs
    image = tf.feature_column.numeric_column('images', shape=(784, ))
    inputs = tf.feature_column.input_layer(features, [image, ])
    # encoder
    e1 = tf.layers.dense(inputs, 512, activation=tf.nn.relu)
    e2 = tf.layers.dense(e1, 256, activation=tf.nn.relu)
    # decoder
    d1 = tf.layers.dense(e2, 512, activation=tf.nn.relu)
    model = tf.layers.dense(d1, 784, activation=tf.nn.relu)
    # training ops
    loss = tf.losses.mean_squared_error(labels, model)
    train = tf.train.AdamOptimizer().minimize(loss, global_step=tf.train.get_global_step())
    if mode == tf.estimator.ModeKeys.TRAIN:
        return tf.estimator.EstimatorSpec(mode=mode,
                                          loss=loss,
                                          train_op=train)

_train_input_fn = tf.estimator.inputs.numpy_input_fn({'images': data.train.images},
                                                     y=np.array(data.train.images),
                                                     batch_size=100,
                                                     shuffle=True)

shutil.rmtree("logs", ignore_errors=True)
tf.logging.set_verbosity(tf.logging.INFO)
estimator = tf.estimator.Estimator(_model_fn, 
                                   model_dir="logs", 
                                   config=tf.contrib.learn.RunConfig(save_checkpoints_steps=1000),
                                   params={})
estimator.train(_train_input_fn, steps=1000)

这是我得到的输出(注意训练如何在 550 步处停止，其中代码明确要求 1000)

And here is the output I get (notice how training stops at 550 steps where the code explicitely calls for a 1000)

INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x12b9fa630>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': None, '_session_config': None, '_save_checkpoints_steps': 1000, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': 'logs'}
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into logs/model.ckpt.
INFO:tensorflow:loss = 0.102862, step = 1
INFO:tensorflow:global_step/sec: 41.8119
INFO:tensorflow:loss = 0.0191228, step = 101 (2.393 sec)
INFO:tensorflow:global_step/sec: 39.9923
INFO:tensorflow:loss = 0.0141014, step = 201 (2.500 sec)
INFO:tensorflow:global_step/sec: 40.9806
INFO:tensorflow:loss = 0.0116138, step = 301 (2.440 sec)
INFO:tensorflow:global_step/sec: 40.0043
INFO:tensorflow:loss = 0.00998991, step = 401 (2.500 sec)
INFO:tensorflow:global_step/sec: 39.2571
INFO:tensorflow:loss = 0.0124132, step = 501 (2.548 sec)
INFO:tensorflow:Saving checkpoints for 550 into logs/model.ckpt.
INFO:tensorflow:Loss for final step: 0.00940801.

<tensorflow.python.estimator.estimator.Estimator at 0x12b9fa780>

更新 #1 我找到了第一个问题的答案.训练在步骤 550 停止的原因是因为 numpy_input_fn() 默认为 num_epochs=1.不过，我仍在寻求有关训练钩子的帮助.

Update #1 I found the answer to the first question. The reason training stopped at step 550 was because numpy_input_fn() defaults to num_epochs=1. I am still looking for help with training hooks though.

推荐答案

估算器可以在 3 种模式下运行.

The estimator can be run in 3 modes.

训练
评价
预测

您当前的代码仅配置为在训练模式下运行.如果您想包括评估步骤，那么您必须首先对模型函数进行一些更改:

your current code is only configured to run in training mode. If you want to include evaluation step then you have to make some changes into the model function first :

def _model_fn(features, labels, mode=None, params=None):
    # define inputs
    image = tf.feature_column.numeric_column('images', shape=(784, ))
    inputs = tf.feature_column.input_layer(features, [image, ])
    # encoder
    e1 = tf.layers.dense(inputs, 512, activation=tf.nn.relu)
    e2 = tf.layers.dense(e1, 256, activation=tf.nn.relu)
    # decoder
    d1 = tf.layers.dense(e2, 512, activation=tf.nn.relu)
    model = tf.layers.dense(d1, 784, activation=tf.nn.relu)
    # training ops
    loss = tf.losses.mean_squared_error(labels, model)
    train = tf.train.AdamOptimizer().minimize(loss, global_step=tf.train.get_global_step())
    if mode == tf.estimator.ModeKeys.TRAIN:
        return tf.estimator.EstimatorSpec(mode=mode,
                                          loss=loss,
                                          train_op=train)

    prec, prec_update_op = tf.metrics.precision(labels=labels,predictions=model), name='precision_op')
    recall, recall_update_op = tf.metrics.recall(labels=labels, predictions=model, name='recall_op')

    metrics={'recall':(recall, recall_update_op), \
               'precision':(prec, prec_update_op)}

    if mode==tf.estimator.ModeKeys.EVAL:
          return tf.estimator.EstimatorSpec(mode, loss=loss, eval_metric_ops=metrics)

现在每 10 步做一次评估和打印损失输出.

Now to do evaluation and print loss output every 10 steps.

configuration = tf.estimator.RunConfig(
  model_dir = 'logs',
  keep_checkpoint_max=5,
  save_checkpoints_steps=1500,
  log_step_count_steps=10)  # set the frequency of logging steps for loss function

estimator = tf.estimator.Estimator(model_fn = _model_fn, params = {}, config=configuration)

train_spec = tf.estimator.TrainSpec(input_fn=_train_input_fn, steps=5000) 
eval_spec = tf.estimator.EvalSpec(input_fn=_train_input_fn, steps=100, throttle_secs=600)

tf.estimator.train_and_evaluate(classifier, train_spec, eval_spec)

注意:

在保存每个新检查点后(即每 1500 步)，评估运行 100 步，然后继续训练.
log_step_count_steps 每 X 步打印一次损失输出.
参数 throttle_secs 定义了两个连续评估步骤之间的最小秒数.如果在此秒数之前存储了新的检查点，则跳过评估.

after every new checkpoint is saved (i.e at every 1500 step), evaluation is run for 100 steps and then training resumes.
The log_step_count_steps prints loss output every X steps.
The argument throttle_secs defines the minimum number of seconds between two consecutive evaluation steps. If a new checkpoint is stored before this number of seconds then evaluation is skipped.

以上将在同一数据集上进行训练和评估，如果您希望在不同的数据集上完成，则将其(数据集的)合适的输入函数传递给 input_fn 在 的参数>tf.estimator.EvalSpec

The above will train and evaluate on the same dataset, if you want it to be done on a different dataset, then pass it's(the dataset's) suitable input function to the argument input_fn at tf.estimator.EvalSpec

这篇关于使用 tf.estimator.Estimator 时步数不匹配的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用 tf.estimator.Estimator 时步数不匹配 [英] Number of steps doesn't match when using tf.estimator.Estimator

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用 tf.estimator.Estimator 时步数不匹配 [英] Number of steps doesn&#39;t match when using tf.estimator.Estimator

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

使用 tf.estimator.Estimator 时步数不匹配 [英] Number of steps doesn't match when using tf.estimator.Estimator

登录关闭