如何使用 tensorflow 的 Estimator API 控制何时计算评估与训练? [英] How to control when to compute evaluation vs training using the Estimator API of tensorflow?

查看：43 发布时间：2021/9/5 18:53:57 python tensorflow

本文介绍了如何使用 tensorflow 的 Estimator API 控制何时计算评估与训练?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

tensorflow 文档没有提供任何示例说明如何在评估集上对模型进行定期评估

The tensorflow documentation does not provide any example of how to perform a periodic evaluation of the model on an evaluation set

接受的答案建议使用实验(根据此自述文件).

The accepted answer suggested the use of Experiment (which is deprecated according to this README).

我在网上找到的所有内容都指向使用 train_and_evaluate 方法.但是，我仍然没有看到如何在两个过程(训练和评估)之间切换.我尝试了以下方法:

All I found on online points towards using the train_and_evaluate method. However, I still do not see how to switch between the two processes (train and evaluate). I have tried the following:

estimator = tf.estimator.Estimator(
    model_fn=model_fn,
    params=hparams,
    model_dir=model_dir,
    config = tf.estimator.RunConfig(
        save_checkpoints_steps = 2000,
        save_summary_steps = 100,
        keep_checkpoint_max=5
    )
)

train_input_fn = lambda: input_fn(
    train_file, #a .tfrecords file
    train=True,
    batch_size=70,
    num_epochs=100
)

eval_input_fn = lambda: input_fn(
    val_file, # another .tfrecords file
    train=False,
    batch_size=70,
    num_epochs=1
)
train_spec = tf.estimator.TrainSpec(
    train_input_fn,
    max_steps=125
)    

eval_spec = tf.estimator.EvalSpec(
    eval_input_fn,
    steps=30,
    name='validation',
    start_delay_secs=150,
    throttle_secs=200
)

tf.logging.info("start experiment...")
tf.estimator.train_and_evaluate(
    estimator,
    train_spec,
    eval_spec
)

这是我认为我的代码应该做的:

Here is what I think my code should be doing:

使用 70 的批量大小训练模型 100 个时期；每 2000 批保存一次检查点；每100批次保存摘要；最多保留 5 个检查点；在训练集上 150 个批次后，使用 30 个批次的验证数据计算验证误差

Train the model for 100 epochs using a batch size of 70; save checkpoints every 2000 batches; save summaries every 100 batches; keep at most 5 checkpoints; after 150 batches on the training set, compute the validation error using 30 batches of validation data

但是，我得到以下日志:

However, I get the following logs:

INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 1 into /output/model.ckpt.
INFO:tensorflow:loss = 39.55082, step = 1
INFO:tensorflow:global_step/sec: 178.622
INFO:tensorflow:loss = 1.0455043, step = 101 (0.560 sec)
INFO:tensorflow:Saving checkpoints for 150 into /output/model.ckpt.
INFO:tensorflow:Loss for final step: 0.8327793.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-04-02-22:49:15
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /projects/MNIST-GCP/output/model.ckpt-150
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Evaluation [3/30]
INFO:tensorflow:Evaluation [6/30]
INFO:tensorflow:Evaluation [9/30]
INFO:tensorflow:Evaluation [12/30]
INFO:tensorflow:Evaluation [15/30]
INFO:tensorflow:Evaluation [18/30]
INFO:tensorflow:Evaluation [21/30]
INFO:tensorflow:Evaluation [24/30]
INFO:tensorflow:Evaluation [27/30]
INFO:tensorflow:Evaluation [30/30]
INFO:tensorflow:Finished evaluation at 2018-04-02-22:49:15
INFO:tensorflow:Saving dict for global step 150: accuracy = 0.8552381, global_step =150, loss = 0.95031387

从日志来看，似乎训练在第一个评估步骤后停止.我从文档中遗漏了什么?你能解释一下我应该如何实现我认为我的代码在做什么吗?

From the logs, it seems that the training stops after the first evaluation step. What am I missing from the documentation? Could you explain me how I should have implemented what I think my code is doing?

其他信息我正在使用 MNIST 数据集运行所有内容，该数据集在训练集中有 50,000 张图像，因此(我认为)该模型应该运行 *num_epochs*50,000/batch_size ≃ 7,000 步*

Additional info I am running everything using the MNIST dataset, which has 50,000 images in the training set, so (I think) the model should run for *num_epochs*50,000/batch_size ≃ 7,000 steps*

我衷心感谢您的帮助！

在运行实验后，我意识到 max_steps 控制整个训练过程的步数，而不仅仅是计算测试集上的指标之前的步数.阅读 tf.estimator.Estimator.train，我看到它有一个 steps 参数，它以增量方式工作并且受 max_steps 的限制；但是，tf.estimator.TrainSpec 没有 steps 参数，这意味着我无法控制在验证集上计算指标之前要采取的步骤数.

after running experiments I realize that max_steps controls the number of steps of the whole training procedure, not just the amount of steps before computing the metrics on the test set. Reading tf.estimator.Estimator.train, I see it has a steps argument, which works incrementally and is bounded by max_steps; however, tf.estimator.TrainSpec does not have the steps argument, which means I cannot control the number of steps to take before computing metrics on the validation set.

如何使用 tensorflow 的 Estimator API 控制何时计算评估与训练? [英] How to control when to compute evaluation vs training using the Estimator API of tensorflow?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何使用 tensorflow 的 Estimator API 控制何时计算评估与训练? [英] How to control when to compute evaluation vs training using the Estimator API of tensorflow?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭