为什么我的样式转换模型在3700/20000批处理后突然停止学习? [英] Why did my style-transfer model suddenly stop learning after 3700/20000 batches?

查看:97
本文介绍了为什么我的样式转换模型在3700/20000批处理后突然停止学习?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

续前:我正在复制如下图所示的感知样式转换模型:

I'm working on replicating the perceptual style transfer model as diagrammed below:

我终于对COCO2014数据集中的1000张图像进行了模型学习.但是后来我尝试运行整个数据集的2个时期,每个时期20695个批次(根据研究论文).它开始很快学习,但是经过大约3700个步骤后,它神秘地失败了. (每100批次保存1张生成的图像,最新的保存在左侧)

I finally have my model learning as expected on 1000 images from the COCO2014 dataset. But then I tried to run 2 epochs of the entire dataset, with 20695 batches per epoch (as per the research paper.) It starts learning very quickly, but after about 3700 steps it just mysteriously fails. (saving 1 generated image every 100 batches, most recent on the left)

我对保存的检查点所做的预测显示了相似的结果:

The predictions I make with the saved checkpoints show similar results:

看起来像是在故障点附近的损失,

looking as the losses near the point of failure, I see:

# output_1 is content_loss
# output_2-6 are gram matrix style_loss values
 [batch:3400/20695] - loss: 953168.7218 - output_1_loss: 123929.1953 - output_2_loss: 55090.2109 - output_3_loss: 168500.2344 - output_4_loss: 139039.1250 - output_5_loss: 355890.0312 - output_6_loss: 110718.5781

 [batch:3500/20695] - loss: 935344.0219 - output_1_loss: 124042.5938 - output_2_loss: 53807.3516 - output_3_loss: 164373.4844 - output_4_loss: 135753.5938 - output_5_loss: 348085.6250 - output_6_loss: 109280.0469

 [batch:3600/20695] - loss: 918017.2146 - output_1_loss: 124055.9922 - output_2_loss: 52535.9062 - output_3_loss: 160401.0469 - output_4_loss: 132601.0156 - output_5_loss: 340561.5938 - output_6_loss: 107860.3047

 [batch:3700/20695] - loss: 901454.0553 - output_1_loss: 124096.1328 - output_2_loss: 51326.8672 - output_3_loss: 156607.0312 - output_4_loss: 129584.2578 - output_5_loss: 333345.5312 - output_6_loss: 106493.0781

 [batch:3750/20695] - loss: 893397.4667 - output_1_loss: 124108.4531 - output_2_loss: 50735.1992 - output_3_loss: 154768.8281 - output_4_loss: 128128.1953 - output_5_loss: 329850.2188 - output_6_loss: 105805.6250

# total loss increases after batch=3750. WHY???

 [batch:3800/20695] - loss: 1044768.7239 - output_1_loss: 123897.2188 - output_2_loss: 101063.2812 - output_3_loss: 200778.2812 - output_4_loss: 141584.6875 - output_5_loss: 370377.5000 - output_6_loss: 107066.7812

 [batch:3900/20695] - loss: 1479362.4735 - output_1_loss: 123050.9766 - output_2_loss: 200276.5156 - output_3_loss: 356414.2188 - output_4_loss: 185420.0781 - output_5_loss: 502506.7500 - output_6_loss: 111692.8750 

我无法开始思考如何调试此问题.一旦起作用",模型是否应该继续起作用?似乎有些缓冲区溢出,但是我不知道如何找到它.有什么想法吗?

I can't begin to think of how to debug this problem. Once it "works", should the model continue to work? It seems like some kind of buffer overflow, but I have no idea how to find it. Any ideas?

完整的colab笔记本/存储库可在以下位置找到: https://colab.research.google.com/github/mixuala/fast_neural_style_pytorch/blob/master/notebook/%5BSO%5D_Coco14_FastStyleTransfer.ipynb

the full colab notebook/repo can be found here: https://colab.research.google.com/github/mixuala/fast_neural_style_pytorch/blob/master/notebook/%5BSO%5D_Coco14_FastStyleTransfer.ipynb

推荐答案

我发现饱和的白色图像RGB = 255,导致模型变得不稳定.以batch = 3696,batch_size = 4的形式出现.当我跳过那一批时,一切正常.

I found a saturated white image, RGB=255, that caused the model to become unstable. appeared in batch=3696, batch_size=4. when I skipped that batch, everything worked fine.

我知道,在尝试对图像的域进行规范化时,有一些监控代码会被零除错误.但是我不确定该错误是否与模型不稳定有关.该模型生成的图像全为黑色

I know that there was some monitoring code that got a divide by zero error when trying to normalized the domain of the image. But I'm not sure if that error is connected to the model destabilization. The generated image from the model was all black

这篇关于为什么我的样式转换模型在3700/20000批处理后突然停止学习?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆