什么是“批量标准化"?为什么要使用它?它如何影响预测? [英] What is "batch normalizaiton"? why using it? how does it affect prediction?

查看:331
本文介绍了什么是“批量标准化"?为什么要使用它?它如何影响预测?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

最近,许多深度架构都使用批处理规范化"进行训练.

Recently, many deep architectures use "batch normalization" for training.

什么是批量标准化"?它在数学上有什么作用?它以什么方式帮助培训过程?

What is "batch normalization"? What does it do mathematically? In what way does it help the training process?

在培训期间如何使用批量标准化?是插入模型的特殊层吗?我需要在每层之前进行标准化还是仅进行一次标准化?

How is batch normalization used during training? is it a special layer inserted into the model? Do I need to normalize before each layer, or only once?

假设我使用批量归一化进行训练.这会影响我的测试时间模型吗?我应该在我的部署"网络中用其他/等效层/操作替换批处理规范化吗?

Suppose I used batched normalization for training. Does this affect my test-time model? Should I replace the batch normalization with some other/equivalent layer/operation in my "deploy" network?

关于批处理规范化的问题仅涵盖了部分问题,我的目标是并希望获得更详细的答案.更具体地说,我想知道采用批量归一化进行的培训如何影响测试时间的预测,即部署"网络和网络的TEST阶段.

This question about batch normalization only covers part of this question, I was aiming and hoping for a more detailed answer. More specifically, I would like to know how training with batch normalization affect test time prediction, i.e., the "deploy" network and the TEST phase of the net.

推荐答案

批量归一化适用于可能遭受有害漂移的层.数学很简单:找到每个分量的均值和方差,然后应用标准变换将所有值转换为相应的Z分数:减去均值并除以标准差.这样可以确保组件范围非常相似,以便每个组件都有机会影响训练增量(反向传播).

The batch normalization is for layers that can suffer from deleterious drift. The math is simple: find the mean and variance of each component, then apply the standard transformation to convert all values to the corresponding Z-scores: subtract the mean and divide by the standard deviation. This ensures that the component ranges are very similar, so that they'll each have a chance to affect the training deltas (in back-prop).

如果您正在使用网络进行纯测试(无需进一步培训),则只需删除这些图层即可;他们已经完成了工作.如果您在测试/预测/分类时正在训练,则将它们放在适当的位置;这些操作完全不会损害您的结果,并且几乎不会减慢正向计算的速度.

If you're using the network for pure testing (no further training), then simply delete these layers; they've done their job. If you're training while testing / predicting / classifying, then leave them in place; the operations won't harm your results at all, and barely slow down the forward computations.

关于Caffe的细节,Caffe确实没有什么特别的.计算是一个基本的统计过程,并且对于任何框架而言都是相同的代数.当然,将对支持矢量和矩阵数学的硬件进行一些优化,但是这些优化只是简单地利用了芯片的内置操作.

As for Caffe specifics, there's really nothing particular to Caffe. The computation is a basic stats process, and is the same algebra for any framework. Granted, there will be some optimizations for hardware that supports vector and matrix math, but those consist of simply taking advantage of the chip's built-in operations.

回应评论

如果您可以负担得起一些额外的培训时间,是的,您希望在每一层都进行标准化.实际上,将它们插入的频率降低(例如,每1-3个开始)就可以了.

If you can afford a little extra training time, yes, you'd want to normalize at every layer. In practice, inserting them less frequently -- say, every 1-3 inceptions -- will work just fine.

您可以在部署中忽略这些,因为它们已经完成了工作:没有反向传播时,权重就不会漂移.另外,当模型在每个批次中仅处理一个实例时,Z分数始终为0:每个输入正好是该批次的平均值(即整个批次).

You can ignore these in deployment because they've already done their job: when there's no back-propagation, there's no drift of weights. Also, when the model handles only one instance in each batch, the Z-score is always 0: every input is exactly the mean of the batch (being the entire batch).

这篇关于什么是“批量标准化"?为什么要使用它?它如何影响预测?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆