从保存的检查点恢复训练模型时,Tensorflow 批量损失峰值? [英] Tensorflow batch loss spikes when restoring model for training from saved checkpoint?

查看:28
本文介绍了从保存的检查点恢复训练模型时,Tensorflow 批量损失峰值?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到了一个奇怪的问题,我一直在尝试调试,但运气不佳.我的模型开始正确训练,批量损失持续减少(从最初的 ~6000 到 20 轮后的 ~120).但是,当我暂停训练并稍后通过从检查点恢复模型来恢复训练时,批次损失似乎从之前的批次损失(暂停之前)意外飙升,并从更高的损失点恢复下降.我担心的是,当我恢复模型进行评估时,我可能没有使用我认为的训练模型.

I'm encountering a strange issue that I've been trying to debug, without much luck. My model starts training properly with batch loss decreasing consistently (from ~6000 initially to ~120 after 20 epochs). However, when I pause training and resume training later by restoring the model from the checkpoint, the batch loss seems to spike unexpectedly from the previous batch loss (before pausing), and resumes decreasing from that higher loss point. My worry is that when I restore the model for evaluation, I may not be using the trained model that I think I am.

与 Tensorflow 教程相比,我已经多次梳理了我的代码.我尝试确保使用教程建议的方法进行保存和恢复.下面是代码快照:https://github.com/KaranKash/DigitSpeak/tree/b7dad3128c88061ee374ae127579ec25cc7f5286"/a> - train.py 文件包含保存和恢复步骤、图形设置和训练过程;而 model.py 创建网络层并计算损失.

I have combed over my code several times, comparing to the Tensorflow tutorials. I tried to ensure that I was saving and restoring using the tutorial-suggested methods. Here is the code snapshot: https://github.com/KaranKash/DigitSpeak/tree/b7dad3128c88061ee374ae127579ec25cc7f5286 - the train.py file contains the saving and restoring steps, the graph setup and training process; while model.py creates the network layers and computes loss.

这是我的打印语句中的一个示例 - 当从 epoch 7 的检查点恢复训练时,注意批次损失急剧上升:

Here is an example from my print statements - notice batch loss rises sharply when resuming training from epoch 7's checkpoint:

Epoch 6. Batch 31/38. Loss 171.28
Epoch 6. Batch 32/38. Loss 167.02
Epoch 6. Batch 33/38. Loss 173.29
Epoch 6. Batch 34/38. Loss 159.76
Epoch 6. Batch 35/38. Loss 164.17
Epoch 6. Batch 36/38. Loss 161.57
Epoch 6. Batch 37/38. Loss 165.40
Saving to /Users/user/DigitSpeak/cnn/model/model.ckpt
Epoch 7. Batch 0/38. Loss 169.99
Epoch 7. Batch 1/38. Loss 178.42
KeyboardInterrupt
dhcp-18-189-118-233:cnn user$ python train.py
Starting loss calculation...
Found in-progress model. Will resume from there.
Epoch 7. Batch 0/38. Loss 325.97
Epoch 7. Batch 1/38. Loss 312.10
Epoch 7. Batch 2/38. Loss 295.61
Epoch 7. Batch 3/38. Loss 306.96
Epoch 7. Batch 4/38. Loss 290.58
Epoch 7. Batch 5/38. Loss 275.72
Epoch 7. Batch 6/38. Loss 251.12

我已经打印了 inspect_checkpoint.py 脚本的结果.我还对其他损失函数(Adam 和 GradientDescentOptimizer)进行了试验,并注意到在恢复训练后峰值损失方面的行为相同.

I've printed the results of the inspect_checkpoint.py script. I've also experimented with other loss functions (Adam and GradientDescentOptimizer) and noticed the same behavior with respect to spiked loss after resuming training.

dhcp-18-189-118-233:cnn user$ python inspect_checkpoint.py
Optimizer/Variable (DT_INT32) []
conv1-layer/bias (DT_FLOAT) [64]
conv1-layer/bias/Momentum (DT_FLOAT) [64]
conv1-layer/weights (DT_FLOAT) [5,23,1,64]
conv1-layer/weights/Momentum (DT_FLOAT) [5,23,1,64]
conv2-layer/bias (DT_FLOAT) [512]
conv2-layer/bias/Momentum (DT_FLOAT) [512]
conv2-layer/weights (DT_FLOAT) [5,1,64,512]
conv2-layer/weights/Momentum (DT_FLOAT) [5,1,64,512]

推荐答案

我遇到了这个问题,发现这是我在恢复图时初始化图变量的事实——扔掉了所有学习的参数,要替换使用最初为原始图形定义中的每个相应张量指定的任何初始化值.

I ran into this issue and found it was the fact that I was initializing the graph variables when restoring the graph -- throwing away all learned parameters, to be replaced with whatever initialization values were originally specified for each respective tensor in the original graph definition.

例如,如果您使用 tf.global_variable_initializer() 将变量初始化为模型程序的一部分,无论您的控制逻辑表明将恢复保存的图形,请确保图形恢复流程省略:sess.run(tf.global_variable_initializer())

For example, if you used tf.global_variable_initializer() to initialize variables as part of your model program, whatever your control logic to indicate that a saved graph will be restored, make sure the graph restore flow omits: sess.run(tf.global_variable_initializer())

这对我来说是一个简单但代价高昂的错误,所以我希望其他人能省下几根白发(或一般的头发).

This was a simple, but costly mistake for me, so I hope someone else is saved a few grey hairs (or hairs, in general).

这篇关于从保存的检查点恢复训练模型时,Tensorflow 批量损失峰值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆