如何使用 tensorflow 的 Estimator API 控制何时计算评估与训练? [英] How to control when to compute evaluation vs training using the Estimator API of tensorflow?

查看:43
本文介绍了如何使用 tensorflow 的 Estimator API 控制何时计算评估与训练?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题所述:

tensorflow 文档没有提供任何示例说明如何在评估集上对模型进行定期评估

The tensorflow documentation does not provide any example of how to perform a periodic evaluation of the model on an evaluation set

接受的答案建议使用实验(根据 此自述文件).

The accepted answer suggested the use of Experiment (which is deprecated according to this README).

我在网上找到的所有内容都指向使用 train_and_evaluate 方法.但是,我仍然没有看到如何在两个过程(训练和评估)之间切换.我尝试了以下方法:

All I found on online points towards using the train_and_evaluate method. However, I still do not see how to switch between the two processes (train and evaluate). I have tried the following:

estimator = tf.estimator.Estimator(
    model_fn=model_fn,
    params=hparams,
    model_dir=model_dir,
    config = tf.estimator.RunConfig(
        save_checkpoints_steps = 2000,
        save_summary_steps = 100,
        keep_checkpoint_max=5
    )
)

train_input_fn = lambda: input_fn(
    train_file, #a .tfrecords file
    train=True,
    batch_size=70,
    num_epochs=100
)

eval_input_fn = lambda: input_fn(
    val_file, # another .tfrecords file
    train=False,
    batch_size=70,
    num_epochs=1
)
train_spec = tf.estimator.TrainSpec(
    train_input_fn,
    max_steps=125
)    

eval_spec = tf.estimator.EvalSpec(
    eval_input_fn,
    steps=30,
    name='validation',
    start_delay_secs=150,
    throttle_secs=200
)

tf.logging.info("start experiment...")
tf.estimator.train_and_evaluate(
    estimator,
    train_spec,
    eval_spec
)

这是我认为我的代码应该做的:

Here is what I think my code should be doing:

使用 70 的批量大小训练模型 100 个时期;每 2000 批保存一次检查点;每100批次保存摘要;最多保留 5 个检查点;在训练集上 150 个批次后,使用 30 个批次的验证数据计算验证误差

Train the model for 100 epochs using a batch size of 70; save checkpoints every 2000 batches; save summaries every 100 batches; keep at most 5 checkpoints; after 150 batches on the training set, compute the validation error using 30 batches of validation data

但是,我得到以下日志:

However, I get the following logs:

INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 1 into /output/model.ckpt.
INFO:tensorflow:loss = 39.55082, step = 1
INFO:tensorflow:global_step/sec: 178.622
INFO:tensorflow:loss = 1.0455043, step = 101 (0.560 sec)
INFO:tensorflow:Saving checkpoints for 150 into /output/model.ckpt.
INFO:tensorflow:Loss for final step: 0.8327793.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-04-02-22:49:15
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /projects/MNIST-GCP/output/model.ckpt-150
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Evaluation [3/30]
INFO:tensorflow:Evaluation [6/30]
INFO:tensorflow:Evaluation [9/30]
INFO:tensorflow:Evaluation [12/30]
INFO:tensorflow:Evaluation [15/30]
INFO:tensorflow:Evaluation [18/30]
INFO:tensorflow:Evaluation [21/30]
INFO:tensorflow:Evaluation [24/30]
INFO:tensorflow:Evaluation [27/30]
INFO:tensorflow:Evaluation [30/30]
INFO:tensorflow:Finished evaluation at 2018-04-02-22:49:15
INFO:tensorflow:Saving dict for global step 150: accuracy = 0.8552381, global_step =150, loss = 0.95031387

从日志来看,似乎训练在第一个评估步骤后停止.我从文档中遗漏了什么?你能解释一下我应该如何实现我认为我的代码在做什么吗?

From the logs, it seems that the training stops after the first evaluation step. What am I missing from the documentation? Could you explain me how I should have implemented what I think my code is doing?

其他信息 我正在使用 MNIST 数据集运行所有内容,该数据集在训练集中有 50,000 张图像,因此(我认为)该模型应该运行 *num_epochs*50,000/batch_size ≃ 7,000 步*

Additional info I am running everything using the MNIST dataset, which has 50,000 images in the training set, so (I think) the model should run for *num_epochs*50,000/batch_size ≃ 7,000 steps*

我衷心感谢您的帮助!

在运行实验后,我意识到 max_steps 控制整个训练过程的步数,而不仅仅是计算测试集上的指标之前的步数.阅读 tf.estimator.Estimator.train,我看到它有一个 steps 参数,它以增量方式工作并且受 max_steps 的限制;但是,tf.estimator.TrainSpec 没有 steps 参数,这意味着我无法控制在验证集上计算指标之前要采取的步骤数.

after running experiments I realize that max_steps controls the number of steps of the whole training procedure, not just the amount of steps before computing the metrics on the test set. Reading tf.estimator.Estimator.train, I see it has a steps argument, which works incrementally and is bounded by max_steps; however, tf.estimator.TrainSpec does not have the steps argument, which means I cannot control the number of steps to take before computing metrics on the validation set.

推荐答案

根据我的理解,评估是使用从最新检查点重新生成的模型进行的.在您的情况下,您直到 2000 步才保存检查点.您还指定 max_steps=125,这将优先于您提供给模型的数据集.

From my understanding, evaluation happens using a respawned model from the latest checkpoint. In your case, you don't save a checkpoint until 2000 steps. You also indicate max_steps=125, which will take precedence over the data set you feed your model.

因此,即使您指定了 70 和 100 epoch 的批量大小,您的模型还是在 125 步时停止训练,这远低于检查点限制 2000 步,这反过来限制了评估,因为评估取决于检查点模型.

Therefore, even though you indicate batch size of 70 and 100 epochs, your model has stopped training at 125 steps, which is well below the checkpoint limit of 2000 steps, which in turn limits evaluation, because evaluation depends on the checkpoint model.

请注意,默认情况下,每次检查点保存时都会进行评估,假设您没有设置 throttle_secs 限制.

Note by default, evaluation happens with every checkpoint save, assuming you don't set a throttle_secs limit.

这篇关于如何使用 tensorflow 的 Estimator API 控制何时计算评估与训练?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆