Keras辍学实施正确吗? [英] Is the Keras implementation of dropout correct?

查看:133
本文介绍了Keras辍学实施正确吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

辍学参考的Keras实现本文

以下摘录来自该论文:

这个想法是在测试时使用单个神经网络而不会出现辍学现象. 该网络的权重是受过培训的按比例缩小的版本 重量.如果在训练期间以概率p保留一个单元,则 在测试时间,该单位的输出权重乘以p为 如图2所示.

The idea is to use a single neural net at test time without dropout. The weights of this network are scaled-down versions of the trained weights. If a unit is retained with probability p during training, the outgoing weights of that unit are multiplied by p at test time as shown in Figure 2.

Keras文档中提到辍学仅在火车上使用,以及Dropout实现的以下代码行

The Keras documentation mentions that dropout is only used at train time, and the following line from the Dropout implementation

x = K.in_train_phase(K.dropout(x, level=self.p), x)

似乎表明确实在测试期间简单地传递了各层的输出.

seems to indicate that indeed outputs from layers are simply passed along during test time.

此外,正如论文所建议的那样,在培训完成后,我找不到能按比例缩小代码的代码.我的理解是,扩展步骤从根本上是进行辍学工作所必需的,因为这等效于在子网"集合中获取中间层的预期输出.没有它,就不能再考虑从这种子网"集合中进行计算.

Further, I cannot find code which scales down the weights after training is complete as the paper suggests. My understanding is this scaling step is fundamentally necessary to make dropout work, since it is equivalent to taking the expected output of intermediate layers in an ensemble of "subnetworks." Without it, the computation can no longer be considered sampling from this ensemble of "subnetworks."

那么,我的问题是,如果真的在Keras中实现了辍学的这种缩放效应呢?

My question, then, is where is this scaling effect of dropout implemented in Keras, if at all?

更新1:好的,所以Keras使用倒置的dropout,尽管在Keras文档和代码中将其称为dropout.链接 https://stats.stackexchange.com/questions/205932/dropout-scaling-the-activation-versus -inverting-the-dropout .我可以看到它们做类似的事情,但是我还没有看到有人说它们是完全一样的.我认为不是.

Update 1: Ok, so Keras uses inverted dropout, though it is called dropout in the Keras documentation and code. The link http://cs231n.github.io/neural-networks-2/#reg doesn't seem to indicate that the two are equivalent. Nor does the answer at https://stats.stackexchange.com/questions/205932/dropout-scaling-the-activation-versus-inverting-the-dropout. I can see that they do similar things, but I have yet to see anyone say they are exactly the same. I think they are not.

一个新的问题:辍学和反向辍学是否相等?明确地说,我正在寻找数学上的理由来说明它们是或不是.

So a new question: Are dropout and inverted dropout equivalent? To be clear, I'm looking for mathematical justification for saying they are or aren't.

推荐答案

是.它已正确实施.从发明Dropout的那一刻起,人们就从实施的角度对其进行了改进. Keras正在使用其中一种技术.它称为倒置辍学,您可以在此处上阅读有关内容.

Yes. It is implemented properly. From the time when Dropout was invented - folks improved it also from the implementation point of view. Keras is using one of this techniques. It's called inverted dropout and you may read about it here.

更新:

说实话-在严格的数学意义上,这两种方法并不等效.在倒置情况下,您会将每个隐藏的激活值乘以一个dropout参数的倒数.但是由于该导数是线性的,因此等效于将所有梯度乘以相同的因子.为了克服这种差异,您必须设置不同的学习权重.从这个角度来看,这种方法是不同的.但是从实际角度来看-这种方法是等效的,因为:

To be honest - in the strict mathematical sense this two approaches are not equivalent. In inverted case you are multiplying every hidden activation by a reciprocal of dropout parameter. But due to that derivative is linear it is equivalent to multiplying all gradient by the same factor. To overcome this difference you must set different learning weight then. From this point of view this approaches differ. But from a practical point view - this approaches are equivalent because:

  1. 如果您使用一种自动设置学习率的方法(例如RMSProp或Adagrad),则几乎不会改变算法.
  2. 如果您使用一种可以自动设置学习率的方法-您必须考虑辍学的随机性,以及由于某些神经元在训练阶段会被关闭(测试/评估阶段)-您必须重新调整学习率以克服这种差异.概率理论为我们提供了最佳的重新缩放因子-它是辍学参数的倒数,它使损失函数梯度长度的期望值在训练阶段和测试/评估阶段均相同.

当然-以上两点都是关于倒置辍学技术的.

Of course - both points above are about inverted dropout technique.

这篇关于Keras辍学实施正确吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆