为什么我的训练损失有规律的峰值? [英] Why does my training loss have regular spikes?

查看:151
本文介绍了为什么我的训练损失有规律的峰值?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在训练位于此问题底部的Keras对象检测模型,尽管我认为我的问题既与Keras无关,也不与我要训练的特定模型(SSD)有关,而与在训练过程中将数据传递到模型的方式.

这是我的问题(请参见下图): 我的训练损失总体上在减少,但显示出尖锐的定期峰值:

x轴上的单位不是训练时期,而是数十个训练步骤.每1390个训练步骤就会出现一次尖峰,这恰好是我的训练数据集上一次完整通过的训练步骤的数量.

峰值总是在每次经过训练数据集之后始终发生,这一事实使我怀疑问题不在于模型本身,而在于训练期间被馈送的数据.

我正在使用存储库中提供的批处理生成器在训练过程中生成批次.我检查了生成器的源代码,并且在每次通过sklearn.utils.shuffle之前,它确实都将训练数据集洗牌了.

我很困惑,原因有两个:

  1. 每次通过之前都要对训练数据集进行洗牌.
  2. 您可以在这款Jupyter笔记本中看到,我m使用生成器的临时数据扩充功能,因此理论上数据集对于任何遍都不应该是相同的:所有扩充都是随机的.

我做了一些测试预测,以查看模型是否确实在学习任何东西,而且是!预测随着时间的推移会变得更好,但是当然,该模型的学习速度非常慢,因为这些峰值似乎每1390步就会弄乱梯度.

任何有关此提示的提示都将不胜感激!我正在使用上面链接的完全相同的Jupyter笔记本进行培训,我更改的唯一变量是批量大小,从32更改为16.除此之外,链接的笔记本包含了我正在遵循的确切培训过程.

这里是包含该模型的存储库的链接:

https://github.com/pierluigiferrari/ssd_keras

解决方案

我自己已经弄清楚了:

TL; DR:

确保损失量与迷你批次的大小无关.

详细说明:

在我看来,这个问题毕竟是Keras特有的.

也许这个问题的解决方案在某些时候对某人有用.

事实证明,Keras将损耗除以最小批量.这里要了解的重要一点是,不是损失函数本身对批次大小进行平均,而是在训练过程中的其他地方进行平均.

为什么这很重要?

我正在训练的模型SSD使用了一个相当复杂的多任务损失函数,该函数进行自己的平均(不是按批次大小,而是按批次中地面真相边界框的数量).现在,如果损失函数已经将损失除以与批次大小相关的某个数字,然后Keras再第二次除以批次大小,那么损失值的大小会突然变大开始取决于批处理大小(准确地说,它与批处理大小成反比).

现在,数据集中的样本数量通常不是您选择的批量大小的整数倍,因此一个时期的最后一个小批量(这里我隐式地将一个时期定义为对数据集的一次完整遍历)最终所含样品少于批次大小.如果取决于批次大小,那么这会弄乱损失的幅度,进而会弄乱梯度的幅度.由于我使用的是带有动量的优化器,因此,混乱的坡度也会继续影响随后几个训练步骤的坡度.

一旦我通过将损失乘以批次大小来调整损失函数(从而将Keras的后续除法乘以批次大小),一切都很好:损失不再增加.

I'm training the Keras object detection model linked at the bottom of this question, although I believe my problem has to do neither with Keras nor with the specific model I'm trying to train (SSD), but rather with the way the data is passed to the model during training.

Here is my problem (see image below): My training loss is decreasing overall, but it shows sharp regular spikes:

The unit on the x-axis is not training epochs, but tens of training steps. The spikes occur precisely once every 1390 training steps, which is exactly the number of training steps for one full pass over my training dataset.

The fact that the spikes always occur after each full pass over the training dataset makes me suspect that the problem is not with the model itself, but with the data it is being fed during the training.

I'm using the batch generator provided in the repository to generate batches during training. I checked the source code of the generator and it does shuffle the training dataset before each pass using sklearn.utils.shuffle.

I'm confused for two reasons:

  1. The training dataset is being shuffled before each pass.
  2. As you can see in this Jupyter notebook, I'm using the generator's ad-hoc data augmentation features, so the dataset should theoretically never be same for any pass: All the augmentations are random.

I made some test predictions to see if the model is actually learning anything, and it is! The predictions get better over time, but of course the model is learning very slowly since those spikes seem to mess up the gradient every 1390 steps.

Any hints as to what this might be are greatly appreciated! I'm using the exact same Jupyter notebook that is linked above for my training, the only variable I changed is the batch size from 32 to 16. Other than that, the linked notebook contains the exact training process I'm following.

Here is a link to the repository that contains the model:

https://github.com/pierluigiferrari/ssd_keras

解决方案

I've figured it out myself:

TL;DR:

Make sure your loss magnitude is independent of your mini-batch size.

The long explanation:

In my case the issue was Keras-specific after all.

Maybe the solution to this problem will be useful for someone at some point.

It turns out that Keras divides the loss by the mini-batch size. The important thing to understand here is that it's not the loss function itself that averages over the batch size, but rather the averaging happens somewhere else in the training process.

Why does this matter?

The model I am training, SSD, uses a rather complicated multi-task loss function that does its own averaging (not by the batch size, but by the number of ground truth bounding boxes in the batch). Now if the loss function already divides the loss by some number that is correlated with the batch size, and afterwards Keras divides by the batch size a second time, then all of a sudden the magnitude of the loss value starts to depend on the batch size (to be precise, it becomes inversely proportional to the batch size).

Now usually the number of samples in your dataset is not an integer multiple of the batch size you choose, so the very last mini-batch of an epoch (here I implicitly define an epoch as one full pass over the dataset) will end up containing fewer samples than the batch size. This is what messes up the magnitude of the loss if it depends on the batch size, and in turn messes up the magnitude of gradient. Since I'm using an optimizer with momentum, that messed up gradient continues influencing the gradients of a few subsequent training steps, too.

Once I adjusted the loss function by multiplying the loss by the batch size (thus reverting Keras' subsequent division by the batch size), everything was fine: No more spikes in the loss.

这篇关于为什么我的训练损失有规律的峰值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆