TensorFlow 1.10+ 自定义估算器使用 train_and_evaluate 提前停止 [英] TensorFlow 1.10+ custom estimator early stopping with train_and_evaluate

查看:45
本文介绍了TensorFlow 1.10+ 自定义估算器使用 train_and_evaluate 提前停止的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设您正在训练一个自定义的 tf.estimator.Estimatortf.estimator.train_and_evaluate 在类似于 @simlmx 的设置中使用验证数据集:

Suppose you are training a custom tf.estimator.Estimator with tf.estimator.train_and_evaluate using a validation dataset in a setup similar to that of @simlmx's:

classifier = tf.estimator.Estimator(
    model_fn=model_fn,
    model_dir=model_dir,
    params=params)

train_spec = tf.estimator.TrainSpec(
    input_fn = training_data_input_fn,
)

eval_spec = tf.estimator.EvalSpec(
    input_fn = validation_data_input_fn,
)

tf.estimator.train_and_evaluate(
    classifier,
    train_spec,
    eval_spec
)

通常,当训练数据集的损失继续改善而不是验证数据集的损失继续改善时,人们会使用验证数据集来切断训练以防止过度拟合.

Often, one uses a validation dataset to cut off training to prevent over-fitting when the loss continues to improve for the training dataset but not for the validation dataset.

目前tf.estimator.EvalSpec 允许指定在多少 steps(默认为 100)后评估模型.

Currently the tf.estimator.EvalSpec allows one to specify after how many steps (defaults to 100) to evaluate the model.

如何(如果可能,不使用 tf.contrib 函数)指定在 n 次评估调用(n * 步)后终止训练>) 其中评估损失没有改善,然后将最佳"模型/检查点(由验证数据集确定)保存到唯一的文件名(例如 best_validation.checkpoint)

How can one (if possible not using tf.contrib functions) designate to terminate training after n number of evaluation calls (n * steps) where the evaluation loss does not improve and then save the "best" model / checkpoint (determined by validation dataset) to a unique file name (e.g. best_validation.checkpoint)

推荐答案

我现在理解你的困惑.stop_if_no_decrease_hook 的文档说明(强调我的):

I understand your confusion now. The documentation for stop_if_no_decrease_hook states (emphasis mine):

max_steps_without_decrease:int,最大训练步数与给定的指标没有减少.

max_steps_without_decrease: int, maximum number of training steps with no decrease in the given metric.

eval_dir:如果设置,目录包含带有评估指标的摘要文件.默认情况下,estimator.eval_dir() 将被使用.

eval_dir: If set, directory containing summary files with eval metrics. By default, estimator.eval_dir() will be used.

查看代码但是,您会发现:

Looking through the code of the hook (version 1.11), though, you find:

def stop_if_no_metric_improvement_fn():
    """Returns `True` if metric does not improve within max steps."""

    eval_results = read_eval_metrics(eval_dir) #<<<<<<<<<<<<<<<<<<<<<<<

    best_val = None
    best_val_step = None
    for step, metrics in eval_results.items(): #<<<<<<<<<<<<<<<<<<<<<<<
      if step < min_steps:
        continue
      val = metrics[metric_name]
      if best_val is None or is_lhs_better(val, best_val):
        best_val = val
        best_val_step = step
      if step - best_val_step >= max_steps_without_improvement: #<<<<<
        tf_logging.info(
            'No %s in metric "%s" for %s steps, which is greater than or equal '
            'to max steps (%s) configured for early stopping.',
            increase_or_decrease, metric_name, step - best_val_step,
            max_steps_without_improvement)
        return True
    return False

代码的作用是加载评估结果(使用您的 EvalSpec 参数生成)并提取评估结果和 global_step(或您使用的任何其他自定义步骤)count) 与特定的评估记录相关联.

What the code does is load the evaluation results (produced with your EvalSpec parameters) and extract the eval results and the global_step (or whichever other custom step you use to count) associated with the specific evaluation record.

这是文档中 training steps 部分的来源:提前停止不是根据非改进评估的数量触发,而是根据一个非改进评估的数量触发一定的步长范围(恕我直言有点违反直觉).

This is the source of the training steps part of the docs: the early stopping is not triggered according to the number of non-improving evaluations, but to the number of non-improving evals in a certain step range (which IMHO is a bit counter-intuitive).

所以,回顾一下:,早期停止钩子使用评估结果来决定何时停止训练,但是您需要传入您想要监控的训练步骤的数量,并记住在该数量的步骤中会发生多少次评估.

So, to recap: Yes, the early-stopping hook uses the evaluation results to decide when it's time to cut the training, but you need to pass in the number of training steps you want to monitor and keep in mind how many evaluations will happen in that number of steps.

假设您正在无限期地训练,每 1k 步进行一次评估.评估运行方式的具体细节并不相关,只要它每 1k 步运行一次,生成我们想要监控的指标即可.

Let's assume you're training indefinitely long having an evaluation every 1k steps. The specifics of how the evaluation runs is not relevant, as long as it runs every 1k steps producing a metric we want to monitor.

如果您将钩子设置为 hook = tf.contrib.estimator.stop_if_no_decrease_hook(my_estimator, 'my_metric_to_monitor', 10000) 钩子将考虑在 10k 步范围内发生的评估.

If you set the hook as hook = tf.contrib.estimator.stop_if_no_decrease_hook(my_estimator, 'my_metric_to_monitor', 10000) the hook will consider the evaluations happening in a range of 10k steps.

由于您每 1k 步运行 1 次 eval,如果连续 10 次 eval 没有任何改进,这归结为提前停止.如果您决定每 2k 步重新运行一次 eval,则钩子将只考虑 5 个连续 eval 的序列而没有改进.

Since you're running 1 eval every 1k steps, this boils down to early-stopping if there's a sequence of 10 consecutive evals without any improvement. If then you decide to rerun with evals every 2k steps, the hook will only consider a sequence of 5 consecutive evals without improvement.

首先,一个重要的注意事项:这与提前停止无关,即在整个训练过程中保留最佳模型副本的问题以及在性能开始后停止训练的问题贬低完全无关.

First of all, an important note: this has nothing to do with early stopping, the issue of keeping a copy of the best model through the training and the one of stopping the training once performance start degrading are completely unrelated.

保持最佳模型可以很容易地定义一个 tf.EvalSpec 中的 estimator.BestExporter(摘自链接):

Keeping the best model can be done very easily defining a tf.estimator.BestExporter in your EvalSpec (snippet taken from the link):

  serving_input_receiver_fn = ... # define your serving_input_receiver_fn
  exporter = tf.estimator.BestExporter(
      name="best_exporter",
      serving_input_receiver_fn=serving_input_receiver_fn,
      exports_to_keep=5) # this will keep the 5 best checkpoints

  eval_spec = [tf.estimator.EvalSpec(
    input_fn=eval_input_fn,
    steps=100,
    exporters=exporter,
    start_delay_secs=0,
    throttle_secs=5)]

如果您不知道如何定义 serving_input_fn 有一个看这里

If you don't know how to define the serving_input_fn have a look here

这允许您保留获得的总体最佳 5 个模型,存储为 SavedModels(这是目前存储模型的首选方式).

This allows you to keep the overall best 5 models you obtained, stored as SavedModels (which is the preferred way to store models at the moment).

这篇关于TensorFlow 1.10+ 自定义估算器使用 train_and_evaluate 提前停止的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆