如何在 tf-slim 中使用evaluation_loop 和train_loop [英] How to use evaluation_loop with train_loop in tf-slim

查看:47
本文介绍了如何在 tf-slim 中使用evaluation_loop 和train_loop的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试实现几个不同的模型并在 CIFAR-10 上训练它们,我想使用 TF-slim 来做到这一点.看起来 TF-slim 有两个在训练期间很有用的主要循环:train_loop 和evaluation_loop.

I'm trying to implement a few different models and train them on CIFAR-10, and I want to use TF-slim to do this. It looks like TF-slim has two main loops that are useful during training: train_loop and evaluation_loop.

我的问题是:使用这些循环的规范方式是什么?作为后续:是否可以使用 train_loop 提前停止?

My question is: what is the canonical way to use these loops? As a followup: is it possible to use early stopping with train_loop?

目前我有一个模型,我的训练文件 train.py 看起来像这样

Currently I have a model and my training file train.py looks like this

import ...
train_log_dir = ...

with tf.device("/cpu:0"):
  images, labels, dataset = set_up_input_pipeline_with_fancy_prefetching( 
                                                                subset='train', ... )
logits, end_points = set_up_model( images ) // Possibly using many GPUs
total_loss = set_up_loss( logits, labels, dataset )
optimizer, global_step = set_up_optimizer( dataset )
train_tensor = slim.learning.create_train_op( 
                                      total_loss, 
                                      optimizer,
                                      global_step=global_step,
                                      clip_gradient_norm=FLAGS.clip_gradient_norm,
                                      summarize_gradients=True)
slim.learning.train(train_tensor, 
                      logdir=train_log_dir,
                      local_init_op=tf.initialize_local_variables(),
                      save_summaries_secs=FLAGS.save_summaries_secs,
                      save_interval_secs=FLAGS.save_interval_secs)

到目前为止很棒 - 我的模型都很好地训练和收敛.我可以从 train_log_dir 中的事件中看到这一点,其中所有指标都朝着正确的方向发展.朝着正确的方向前进让我很高兴.

Which is awesome so far - my models all train and converge nicely. I can see this from the events in train_log_dir where all the metrics are going in the right direction. And going in the right direction makes me happy.

但我想检查一下验证集上的指标是否也在改进.我不知道有什么方法可以很好地与训练循环配合使用 TF-slim,因此我创建了第二个名为 eval.py 的文件,其中包含我的评估循环.

But I'd like to check that the metrics are improving on the validation set, too. I don't know of any way to do with TF-slim in a way that plays nicely with the training loop, so I created a second file called eval.py which contains my evaluation loop.

import ...
train_log_dir = ...

with tf.device("/cpu:0"):
  images, labels, dataset = set_up_input_pipeline_with_fancy_prefetching( 
                                                                subset='validation', ... )
logits, end_points = set_up_model( images )
summary_ops, names_to_values, names_to_updates = create_metrics_and_summary_ops( 
                                                                logits,
                                                                labels,
                                                                dataset.num_classes() )

slim.get_or_create_global_step()
slim.evaluation.evaluation_loop(
      '',
      checkpoint_dir=train_log_dir,
      logdir=train_log_dir,
      num_evals=FLAGS.num_eval_batches,
      eval_op=names_to_updates.values(),
      summary_op=tf.merge_summary(summary_ops),
      eval_interval_secs=FLAGS.eval_interval_secs,
      session_config=config)

问题:

1) 我目前有这个模型用于评估循环占用整个 GPU,但它很少被使用.我认为有更好的方法来分配资源.如果我可以使用相同的评估循环来监控多个不同模型(多个目录中的检查点)的进度,那就太好了.这样的事情可能吗?

1) I currently have this model for the evaluation_loop hogging up an entire GPU, but it's rarely being used. I assume there's a better way to allocate resources. It would be pretty nice if I could use the same evaluation_loop to monitor the progress of multiple different models (checkpoints in multiple directories). Is something like this possible?

2) 评估和培训之间没有反馈.我正在训练大量模型,并且希望使用提前停止来停止未学习或未收敛的模型.有没有办法做到这一点?理想情况下使用来自验证集的信息,但如果它必须仅基于训练数据也可以.

2) There's no feedback between the evaluation and training. I'm training a ton of models and would love to use early stopping to halt the models which aren't learning or are not converging. Is there a way to do this? Ideally using information from the validation set, but if it has to be just based on the training data that's okay, too.

3) 我的工作流程全错了吗,我应该以不同的方式构建它吗?从文档中不清楚如何将评估与培训结合使用.

3) Is my workflow all wrong and I should be structuring it differently? It's not clear from the documentation how to use evaluation in conjunction with training.

更新~~似乎从 TF r0.11 开始,我在调用 slim.evaluation.evaluation_loop 时也遇到了段错误.它只是偶尔发生(对我来说,当我将工作分派到集群时).它发生在 sv.managed_session--特别是 prepare_or_wait_for_session.~~这只是由于评估循环(tensorflow 的第二个实例)尝试使用 GPU,而 GPU 已被第一个实例征用.

Update ~~It seems that as of TF r0.11 I'm also getting a segfault when calling slim.evaluation.evaluation_loop. It only happens sometimes (for me when I dispatch my jobs to a cluster). It happens in sv.managed_session--specifically prepare_or_wait_for_session.~~ This was just due to evaluation loop (a second instance of tensorflow) trying to use the GPU, which was already requisitioned by the first instance.

推荐答案

  1. evaluation_loop 旨在与单个目录一起使用(正如您当前使用的那样).如果您想提高效率,可以使用 slim.evaluation.evaluate_once 并根据需要添加适当的逻辑来交换目录.

  1. evaluation_loop is meant to be used (as you are currently using it) with a single directory. If you want to be more efficient, you could use slim.evaluation.evaluate_once and add the appropriate logic for swapping directories as you find appropriate.

您可以通过覆盖 slim.learning.train(..., train_step_fn) 参数来做到这一点.此参数用自定义函数替换train_step"函数.在这里,您可以提供自定义训练函数,该函数返回您认为合适的total_loss"和should_stop"值.

You can do this by overriding the slim.learning.train(..., train_step_fn) argument. This argument replaces the 'train_step' function with a custom function. Here, you can supply custom training function which returns the 'total_loss' and 'should_stop' values as you see fit.

您的工作流程看起来很棒,这可能是使用 TF-Slim 进行学习/评估最常见的工作流程.

Your workflow looks great, this is probably the most common workflow for learning/eval using TF-Slim.

这篇关于如何在 tf-slim 中使用evaluation_loop 和train_loop的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆