从保存的检查点恢复训练模型时，Tensorflow 批量损失峰值? [英] Tensorflow batch loss spikes when restoring model for training from saved checkpoint?

查看：28 发布时间：2021/9/5 19:21:21 tensorflow

本文介绍了从保存的检查点恢复训练模型时，Tensorflow 批量损失峰值?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我遇到了一个奇怪的问题，我一直在尝试调试，但运气不佳.我的模型开始正确训练，批量损失持续减少(从最初的 ~6000 到 20 轮后的 ~120).但是，当我暂停训练并稍后通过从检查点恢复模型来恢复训练时，批次损失似乎从之前的批次损失(暂停之前)意外飙升，并从更高的损失点恢复下降.我担心的是，当我恢复模型进行评估时，我可能没有使用我认为的训练模型.

I'm encountering a strange issue that I've been trying to debug, without much luck. My model starts training properly with batch loss decreasing consistently (from ~6000 initially to ~120 after 20 epochs). However, when I pause training and resume training later by restoring the model from the checkpoint, the batch loss seems to spike unexpectedly from the previous batch loss (before pausing), and resumes decreasing from that higher loss point. My worry is that when I restore the model for evaluation, I may not be using the trained model that I think I am.

与 Tensorflow 教程相比，我已经多次梳理了我的代码.我尝试确保使用教程建议的方法进行保存和恢复.下面是代码快照:https://github.com/KaranKash/DigitSpeak/tree/b7dad3128c88061ee374ae127579ec25cc7f5286"/a> - train.py 文件包含保存和恢复步骤、图形设置和训练过程；而 model.py 创建网络层并计算损失.

I have combed over my code several times, comparing to the Tensorflow tutorials. I tried to ensure that I was saving and restoring using the tutorial-suggested methods. Here is the code snapshot: https://github.com/KaranKash/DigitSpeak/tree/b7dad3128c88061ee374ae127579ec25cc7f5286 - the train.py file contains the saving and restoring steps, the graph setup and training process; while model.py creates the network layers and computes loss.

这是我的打印语句中的一个示例 - 当从 epoch 7 的检查点恢复训练时，注意批次损失急剧上升:

Here is an example from my print statements - notice batch loss rises sharply when resuming training from epoch 7's checkpoint:

Epoch 6. Batch 31/38. Loss 171.28
Epoch 6. Batch 32/38. Loss 167.02
Epoch 6. Batch 33/38. Loss 173.29
Epoch 6. Batch 34/38. Loss 159.76
Epoch 6. Batch 35/38. Loss 164.17
Epoch 6. Batch 36/38. Loss 161.57
Epoch 6. Batch 37/38. Loss 165.40
Saving to /Users/user/DigitSpeak/cnn/model/model.ckpt
Epoch 7. Batch 0/38. Loss 169.99
Epoch 7. Batch 1/38. Loss 178.42
KeyboardInterrupt
dhcp-18-189-118-233:cnn user$ python train.py
Starting loss calculation...
Found in-progress model. Will resume from there.
Epoch 7. Batch 0/38. Loss 325.97
Epoch 7. Batch 1/38. Loss 312.10
Epoch 7. Batch 2/38. Loss 295.61
Epoch 7. Batch 3/38. Loss 306.96
Epoch 7. Batch 4/38. Loss 290.58
Epoch 7. Batch 5/38. Loss 275.72
Epoch 7. Batch 6/38. Loss 251.12

我已经打印了 inspect_checkpoint.py 脚本的结果.我还对其他损失函数(Adam 和 GradientDescentOptimizer)进行了试验，并注意到在恢复训练后峰值损失方面的行为相同.

I've printed the results of the inspect_checkpoint.py script. I've also experimented with other loss functions (Adam and GradientDescentOptimizer) and noticed the same behavior with respect to spiked loss after resuming training.

dhcp-18-189-118-233:cnn user$ python inspect_checkpoint.py
Optimizer/Variable (DT_INT32) []
conv1-layer/bias (DT_FLOAT) [64]
conv1-layer/bias/Momentum (DT_FLOAT) [64]
conv1-layer/weights (DT_FLOAT) [5,23,1,64]
conv1-layer/weights/Momentum (DT_FLOAT) [5,23,1,64]
conv2-layer/bias (DT_FLOAT) [512]
conv2-layer/bias/Momentum (DT_FLOAT) [512]
conv2-layer/weights (DT_FLOAT) [5,1,64,512]
conv2-layer/weights/Momentum (DT_FLOAT) [5,1,64,512]

从保存的检查点恢复训练模型时，Tensorflow 批量损失峰值? [英] Tensorflow batch loss spikes when restoring model for training from saved checkpoint?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

从保存的检查点恢复训练模型时，Tensorflow 批量损失峰值? [英] Tensorflow batch loss spikes when restoring model for training from saved checkpoint?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭