CNN上的快速损耗收敛指示什么? [英] What fast loss convergence indicates on a CNN?

查看:240
本文介绍了CNN上的快速损耗收敛指示什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在两个不同的DL库(Caffe e Tensorflow)中训练两个CNN(AlexNet e GoogLeNet)。该网络由每个库的开发团队实现(



在葡萄牙语中,Épocas是时代, Perda是亏损。密钥中的数字表示批量大小。



权重衰减和初始学习率与我下载的模型相同,只是更改了数据集并



为什么我的网络以这种方式聚合,而不像


还有其他一些影响,例如您的数据集相对较小(因此更容易学习),不要经常进行测试,也许您会进行一些平滑处理(您的损失是在整个数据集上还是在一个批次上计算的?通常是在一个批次上,它参与了通常的损失函数图)。


曲线之间的差异也是正常的,但仍然是以下事实的特征:在数据集中实际上只有1个类。首先请注意,CPU和GPU的行为相同,因为它们以不同的速度执行相同的操作。当批量大小大于1时,网络中完成的更新是仅对样本进行的所有更新的平均值(再次简化了一点)。因此,通常您会获得更智能的更新(更有可能朝着始终正确预测的方向前进),因此您需要较少的更新来获得良好的性能。在更快的收敛速度与较大的批次每次更新使用更多数据这一事实之间要权衡取舍,因此很难事先说出哪个曲线收敛速度更快。人们普遍认为,您应该使用大小为> 1个(但也不大)。现在,当您在数据集中实际上只存在1个类别时,所有更新都大致沿总是预测1的方向进行,因此最小批量平均值基本上相同,但消耗了更多数据以获取大致相同的更新。由于您仍然需要相同数量的这些更新,因此您将以相同的步骤数收敛,因此您将消耗更多的数据以获得相同的结果。


I'm training two CNNs (AlexNet e GoogLeNet) in two differents DL libraries (Caffe e Tensorflow). The networks was implemented by dev teams of each libraries (here and here)

I reduced the original Imagenet dataset to 1024 images of 1 category -- but setted 1000 categories to classify on the networks.

So I trained the CNNs, varying processing unit (CPU/GPU) and batches sizes, and I observed that the losses converges fastly to near zero (in mostly times before 1 epoch be completed), like in this graph (Alexnet on Tensorflow):

In portuguese, 'Épocas' is epochs and 'Perda' is loss. The number in the keys are refering for batch sizes.

The weight decays and initial learning rate are the same as used on models that I downloaded, I only changed the dataset and the batch sizes.

Why my networks are converging this way, and not like this way?

解决方案

The loss function is less noisy than usual and does not oscillate for a few reasons.

The main one is because you have only 1 category, so (to simplify a bit) the network is easily improving at each step, just by improving the score for that category on all your inputs.

Take a look at the (beautiful !) image below: if you have several classes, a good step for one sample is often a bad one for another sample (because they have different categories), which is why the loss goes up locally sometimes. A network update made on a sample of category 1 is a bad step for all samples of cat 2, and conversely, but the sum of the two types of updates goes in the right direction (they compensate their bad parts, only the useful part of the steps remain). If you have 1 class, you'll go straight and fast to "always predict category 1", whereas with 2 or more categories, you'll zigzag and converge slowly to "always predict correctly".

There are a few other effects, like the fact that your dataset is relatively small (so it's easier to learn), that you don't test that often, and maybe you have some smoothing (is your loss computed on the whole dataset or on a batch ? Usually it's on a batch, which participates in the usual loss function graph).

The difference between your curves is also normal, but still characteristic of the fact that you have only 1 class actually present in the dataset. First notice that the CPU and the GPU have the same behavior, because they do exactly he same thing, just at a different speed. When your batch size is >1, the updates in the network that are done are the average of all the updates that you would have done with the samples alone (again simplifying a bit). So usually you'd get smarter updates (more likely togo in the direction of "Always predict correctly"), so you'd need less updates to reach good performances. There is a tradeoff between this faster convergence and the fact that bigger batches use more data for each update, so it's hard to say beforehand which curve should converge faster. It's widely considered that you should use minibatch of size > 1 (but not too big either). Now when you have only 1 class actually present in the dataset, all updates are roughly in the same direction "Always predict 1", so the minibatch average is basically the same, but consumed more data to get roughly the same update. Since you still need the same number of these updates, you'll converge after the same number of steps, so you'll consume more data for the same result.

这篇关于CNN上的快速损耗收敛指示什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆