为什么我的风格迁移模型在 3700/20000 个批次后突然停止学习? [英] Why did my style-transfer model suddenly stop learning after 3700/20000 batches?

查看:25
本文介绍了为什么我的风格迁移模型在 3700/20000 个批次后突然停止学习?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

接前:

我终于在来自 COCO2014 数据集的 1000 张图像上按预期学习了我的模型.但是后来我尝试运行整个数据集的 2 个 epoch,每个 epoch 有 20695 个批次(根据研究论文).它开始学习非常快,但是在大约 3700 步之后,它就莫名其妙地失败了.(每 100 个批次保存 1 个生成的图像,最近的在左侧)

我对保存的检查点所做的预测显示了类似的结果:

看着接近故障点的损失,我明白了:

# output_1 是 content_loss# output_2-6 是克矩阵 style_loss 值[批次:3400/20695] - 损失:953168.7218 - output_1_loss:123929.1953 - output_2_loss:55090.2109 - output_3_loss:168500.2344 - output_4_loss:13500.2344 - output_4_loss: 13518108181818181818107[批次:3500/20695] - 损失:935344.0219 - output_1_loss:124042.5938 - output_2_loss:53807.3516 - output_3_loss:164373.4844 - output_4_loss: 135608.5605_output_4_loss: 135608.5958[batch:3600/20695] - 损失:918017.2146 - output_1_loss: 124055.9922 - output_2_loss: 52535.9062 - output_3_loss: 160401.0469 - output_4_loss: 135360: 135360_3608_135656_3605606[批次:3700/20695] - 损失:901454.0553 - output_1_loss:124096.1328 - output_2_loss:51326.8672 - output_3_loss:156607.0312 - output_4_loss: 12537304_loss: 1253783_405399[批次:3750/20695] - 损失:893397.4667 - output_1_loss:124108.4531 - output_2_loss:50735.1992 - output_3_loss:154768.8281 - output_4_loss: 12508058.125805_65806# batch=3750 后总损失增加.为什么???[批次:3800/20695] - 损失:1044768.7239 - output_1_loss:123897.2188 - output_2_loss:101063.2812 - output_3_loss:200778.2812 - output_45_loss_7858: -output_1_loss_7708:.6708:.6706[批次:3900/20695]-损失:1479362.4735-output_1_loss:123050.9766-output_2_loss:200276.5156-output_3_loss:356414.2188-output_44_loss205:.1807-output_4_loss5:.1876

我一时想不起来如何调试这个问题.一旦它工作",模型是否应该继续工作?看起来像是某种缓冲区溢出,但我不知道如何找到它.有什么想法吗?

完整的 colab notebook/repo 可以在这里找到:

continued from before: Why does my model work with `tf.GradientTape()` but fail when using `keras.models.Model.fit()`

I'm working on replicating the perceptual style transfer model as diagrammed below:

I finally have my model learning as expected on 1000 images from the COCO2014 dataset. But then I tried to run 2 epochs of the entire dataset, with 20695 batches per epoch (as per the research paper.) It starts learning very quickly, but after about 3700 steps it just mysteriously fails. (saving 1 generated image every 100 batches, most recent on the left)

The predictions I make with the saved checkpoints show similar results:

looking as the losses near the point of failure, I see:

# output_1 is content_loss
# output_2-6 are gram matrix style_loss values
 [batch:3400/20695] - loss: 953168.7218 - output_1_loss: 123929.1953 - output_2_loss: 55090.2109 - output_3_loss: 168500.2344 - output_4_loss: 139039.1250 - output_5_loss: 355890.0312 - output_6_loss: 110718.5781

 [batch:3500/20695] - loss: 935344.0219 - output_1_loss: 124042.5938 - output_2_loss: 53807.3516 - output_3_loss: 164373.4844 - output_4_loss: 135753.5938 - output_5_loss: 348085.6250 - output_6_loss: 109280.0469

 [batch:3600/20695] - loss: 918017.2146 - output_1_loss: 124055.9922 - output_2_loss: 52535.9062 - output_3_loss: 160401.0469 - output_4_loss: 132601.0156 - output_5_loss: 340561.5938 - output_6_loss: 107860.3047

 [batch:3700/20695] - loss: 901454.0553 - output_1_loss: 124096.1328 - output_2_loss: 51326.8672 - output_3_loss: 156607.0312 - output_4_loss: 129584.2578 - output_5_loss: 333345.5312 - output_6_loss: 106493.0781

 [batch:3750/20695] - loss: 893397.4667 - output_1_loss: 124108.4531 - output_2_loss: 50735.1992 - output_3_loss: 154768.8281 - output_4_loss: 128128.1953 - output_5_loss: 329850.2188 - output_6_loss: 105805.6250

# total loss increases after batch=3750. WHY???

 [batch:3800/20695] - loss: 1044768.7239 - output_1_loss: 123897.2188 - output_2_loss: 101063.2812 - output_3_loss: 200778.2812 - output_4_loss: 141584.6875 - output_5_loss: 370377.5000 - output_6_loss: 107066.7812

 [batch:3900/20695] - loss: 1479362.4735 - output_1_loss: 123050.9766 - output_2_loss: 200276.5156 - output_3_loss: 356414.2188 - output_4_loss: 185420.0781 - output_5_loss: 502506.7500 - output_6_loss: 111692.8750 

I can't begin to think of how to debug this problem. Once it "works", should the model continue to work? It seems like some kind of buffer overflow, but I have no idea how to find it. Any ideas?

the full colab notebook/repo can be found here: https://colab.research.google.com/github/mixuala/fast_neural_style_pytorch/blob/master/notebook/%5BSO%5D_Coco14_FastStyleTransfer.ipynb

解决方案

I found a saturated white image, RGB=255, that caused the model to become unstable. appeared in batch=3696, batch_size=4. when I skipped that batch, everything worked fine.

I know that there was some monitoring code that got a divide by zero error when trying to normalized the domain of the image. But I'm not sure if that error is connected to the model destabilization. The generated image from the model was all black

这篇关于为什么我的风格迁移模型在 3700/20000 个批次后突然停止学习?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆