什么是“批量标准化"?为什么使用它?它如何影响预测? [英] What is "batch normalizaiton"? why using it? how does it affect prediction?

查看:53
本文介绍了什么是“批量标准化"?为什么使用它?它如何影响预测?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

最近,许多深度架构使用批量归一化"进行训练.

Recently, many deep architectures use "batch normalization" for training.

什么是批量标准化"?它在数学上有什么作用?它对培训过程有什么帮助?

What is "batch normalization"? What does it do mathematically? In what way does it help the training process?

在训练期间如何使用批量归一化?它是插入模型的特殊层吗?我需要在每一层之前标准化,还是只需要标准化一次?

How is batch normalization used during training? is it a special layer inserted into the model? Do I need to normalize before each layer, or only once?

假设我使用批量归一化进行训练.这会影响我的测试时间模型吗?我应该用我的部署"网络中的其他/等效层/操作替换批量标准化吗?

Suppose I used batched normalization for training. Does this affect my test-time model? Should I replace the batch normalization with some other/equivalent layer/operation in my "deploy" network?

关于批量标准化的这个问题只涵盖了这个问题的一部分,我的目标是并希望得到更详细的答案.更具体地说,我想知道批量归一化训练如何影响测试时间预测,即部署"网络和网络的测试阶段.

This question about batch normalization only covers part of this question, I was aiming and hoping for a more detailed answer. More specifically, I would like to know how training with batch normalization affect test time prediction, i.e., the "deploy" network and the TEST phase of the net.

推荐答案

批量标准化适用于可能遭受有害漂移的层.数学很简单:找到每个分量的均值和方差,然后应用标准变换将所有值转换为相应的 Z 分数:减去均值并除以标准差.这确保组件范围非常相似,因此它们每个都有机会影响训练增量(在反向传播中).

The batch normalization is for layers that can suffer from deleterious drift. The math is simple: find the mean and variance of each component, then apply the standard transformation to convert all values to the corresponding Z-scores: subtract the mean and divide by the standard deviation. This ensures that the component ranges are very similar, so that they'll each have a chance to affect the training deltas (in back-prop).

如果您将网络用于纯测试(无需进一步训练),则只需删除这些层;他们已经完成了他们的工作.如果您在测试/预测/分类的同时进行训练,则将它们留在原地;这些操作根本不会损害您的结果,并且几乎不会减慢前向计算的速度.

If you're using the network for pure testing (no further training), then simply delete these layers; they've done their job. If you're training while testing / predicting / classifying, then leave them in place; the operations won't harm your results at all, and barely slow down the forward computations.

至于 Caffe 的细节,Caffe 真的没有什么特别之处.计算是一个基本的统计过程,对于任何框架都是相同的代数.当然,支持向量和矩阵数学的硬件会有一些优化,但这些优化只是利用芯片的内置运算.

As for Caffe specifics, there's really nothing particular to Caffe. The computation is a basic stats process, and is the same algebra for any framework. Granted, there will be some optimizations for hardware that supports vector and matrix math, but those consist of simply taking advantage of the chip's built-in operations.

回复评论

如果你能负担得起一点额外的训练时间,是的,你会希望在每一层都进行归一化.在实践中,不那么频繁地插入它们——比如,每 1-3 个开始——会工作得很好.

If you can afford a little extra training time, yes, you'd want to normalize at every layer. In practice, inserting them less frequently -- say, every 1-3 inceptions -- will work just fine.

您可以在部署中忽略这些,因为它们已经完成了它们的工作:当没有反向传播时,没有权重漂移.此外,当模型在每个批次中仅处理一个实例时,Z-score 始终为 0:每个输入都恰好是批次(即整个批次)的平均值.

You can ignore these in deployment because they've already done their job: when there's no back-propagation, there's no drift of weights. Also, when the model handles only one instance in each batch, the Z-score is always 0: every input is exactly the mean of the batch (being the entire batch).

这篇关于什么是“批量标准化"?为什么使用它?它如何影响预测?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆