实例规范化与批处理规范化 [英] Instance Normalisation vs Batch normalisation

本文介绍了实例规范化与批处理规范化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道,批量归一化可以通过将激活方向变为单位高斯分布,从而解决梯度消失的问题,从而有助于更快地进行训练.批次规范行为在训练(使用每个批次的均值/变量)和测试时间(使用训练阶段的最终运行均值/变量)时有所不同.

另一方面,

实例规范化如本文中提到的那样用作对比度规范化 https://arxiv.org /abs/1607.08022 .作者提到,输出的风格化图像不应该依赖于输入内容图像的对比度,因此实例规范化是有帮助的.

但是,我们不应该在图像分类中也不使用实例归一化,因为类别标签不应该依赖于输入图像的对比度.我还没有看到任何使用实例规范化代替批处理规范化进行分类的论文.是什么原因呢?同样,可以并且应该将批处理和实例规范化一起使用.我渴望对何时使用哪种规范化有一个直观的以及理论上的理解.

解决方案

定义

让我们从两者的严格定义开始:

批量归一化

实例规范化

正如您所注意到的,除了共同张量的输入张量的数量外,它们在做相同的事情.批处理版本对批处理和空间位置中的所有图像进行标准化 (在CNN情况下,通常情况下这是不同的);实例版本会独立地(即,仅在空间位置之间)对每个批次进行规范化.

换句话说,批处理范式计算一个均值和std dev(从而使整个层分布高斯分布),实例范式计算它们的T,使每个图像分布看起来像高斯分布,而不是合起来.

一个简单的类比:在数据预处理步骤中,可以基于每个图像对数据进行归一化或对整个数据集进行归一化.

信用:公式来自此处.

哪个标准化更好?

答案取决于网络体系结构,尤其取决于在 归一化层之后执行的操作.图像分类网络通常将要素地图堆叠在一起,然后将其连接到FC层,从而在批次中共享权重(现代方法是使用CONV层而不是FC,但该参数仍然适用)

这是开始分配细微差别的地方:同一神经元将接收来自所有图像的输入.如果批次之间的方差很大,则高激活将完全抑制小激活产生的梯度,这正是批处理规范试图解决的问题.这就是为什么按实例进行规范化完全有可能根本无法改善网络融合的原因.

另一方面,批处理归一化为训练增加了额外的噪音,因为特定实例的结果取决于相邻实例.事实证明,这种噪声对网络可能是好的,也可能是坏的.蒂姆·萨利曼斯(Tim Salimans)等人在体重标准化" 论文中对此进行了很好的解释,该论文将其命名为递归神经网络和强化学习DQN作为对噪声敏感的应用程序.我不确定,但是我认为相同的噪声敏感度是样式化任务中要解决的主要问题,实例化规范试图解决这一问题.检查权重规范在此特定任务中是否表现更好会很有趣.

您可以结合批处理和实例规范化吗?

尽管它构成了有效的神经网络,但并没有实际用途.批处理规范化噪声要么有助于学习过程(在这种情况下是可取的),要么会损害学习过程(在这种情况下最好省略它).在这两种情况下,使网络保持一种类型的规范化都有可能改善性能.

I understand that Batch Normalisation helps in faster training by turning the activation towards unit Gaussian distribution and thus tackling vanishing gradients problem. Batch norm acts is applied differently at training(use mean/var from each batch) and test time (use finalized running mean/var from training phase).

Instance normalisation, on the other hand, acts as contrast normalisation as mentioned in this paper https://arxiv.org/abs/1607.08022 . The authors mention that the output stylised images should be not depend on the contrast of the input content image and hence Instance normalisation helps.

But then should we not also use instance normalisation for image classification where class label should not depend on the contrast of input image. I have not seen any paper using instance normalisation in-place of batch normalisation for classification. What is the reason for that? Also, can and should batch and instance normalisation be used together. I am eager to get an intuitive as well as theoretical understanding of when to use which normalisation.

解决方案

Definition

Let's begin with the strict definition of both:

Batch normalization

Instance normalization

As you can notice, they are doing the same thing, except for the number of input tensors that are normalized jointly. Batch version normalizes all images across the batch and spatial locations (in the CNN case, in the ordinary case it's different); instance version normalizes each batch independently, i.e., across spatial locations only.

In other words, where batch norm computes one mean and std dev (thus making the distribution of the whole layer Gaussian), instance norm computes T of them, making each individual image distribution look Gaussian, but not jointly.

A simple analogy: during data pre-processing step, it's possible to normalize the data on per-image basis or normalize the whole data set.

Credit: the formulas are from here.

Which normalization is better?

The answer depends on the network architecture, in particular on what is done after the normalization layer. Image classification networks usually stack the feature maps together and wire them to the FC layer, which share weights across the batch (the modern way is to use the CONV layer instead of FC, but the argument still applies).

This is where the distribution nuances start to matter: the same neuron is going to receive the input from all images. If the variance across the batch is high, the gradient from the small activations will be completely suppressed by the high activations, which is exactly the problem that batch norm tries to solve. That's why it's fairly possible that per-instance normalization won't improve network convergence at all.

On the other hand, batch normalization adds extra noise to the training, because the result for a particular instance depends on the neighbor instances. As it turns out, this kind of noise may be either good and bad for the network. This is well explained in the "Weight Normalization" paper by Tim Salimans at al, which name recurrent neural networks and reinforcement learning DQNs as noise-sensitive applications. I'm not entirely sure, but I think that the same noise-sensitivity was the main issue in stylization task, which instance norm tried to fight. It would be interesting to check if weight norm performs better for this particular task.

Can you combine batch and instance normalization?

Though it makes a valid neural network, there's no practical use for it. Batch normalization noise is either helping the learning process (in this case it's preferable) or hurting it (in this case it's better to omit it). In both cases, leaving the network with one type of normalization is likely to improve the performance.

这篇关于实例规范化与批处理规范化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆