辍学的Keras实现是否正确? [英] Is the Keras implementation of dropout correct?

查看:25
本文介绍了辍学的Keras实现是否正确?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Keras dropout 的实现参考这篇论文.

The Keras implementation of dropout references this paper.

以下摘录自那篇论文:

这个想法是在测试时使用一个单一的神经网络而不会出现 dropout.该网络的权重是经过训练的缩小版本重量.如果在训练期间以概率 p 保留一个单元,则该单元的输出权重在测试时乘以 p 为如图 2 所示.

The idea is to use a single neural net at test time without dropout. The weights of this network are scaled-down versions of the trained weights. If a unit is retained with probability p during training, the outgoing weights of that unit are multiplied by p at test time as shown in Figure 2.

Keras 文档提到 dropout 仅在训练时使用,以下是 Dropout 实现中的行

The Keras documentation mentions that dropout is only used at train time, and the following line from the Dropout implementation

x = K.in_train_phase(K.dropout(x, level=self.p), x)

似乎表明层的输出确实在测试期间简单地传递了.

seems to indicate that indeed outputs from layers are simply passed along during test time.

此外,我找不到在训练完成后按论文建议缩小权重的代码.我的理解是,这个缩放步骤对于 dropout 工作是必不可少的,因为它相当于在子网络"的集合中获取中间层的预期输出.没有它,计算就不能再被认为是从这个子网络"集合中采样的.

Further, I cannot find code which scales down the weights after training is complete as the paper suggests. My understanding is this scaling step is fundamentally necessary to make dropout work, since it is equivalent to taking the expected output of intermediate layers in an ensemble of "subnetworks." Without it, the computation can no longer be considered sampling from this ensemble of "subnetworks."

那么,我的问题是,如果有的话,这种辍学在 Keras 中实现的缩放效果在哪里?

My question, then, is where is this scaling effect of dropout implemented in Keras, if at all?

更新 1: 好的,所以 Keras 使用反向 dropout,尽管它在 Keras 文档和代码中称为 dropout.链接 http://cs231n.github.io/neural-networks-2/#reg 似乎并不表明两者是等价的.https://stats.stackexchange.com/questions/205932/dropout-scaling-the-activation-versus-反转辍学.我可以看到他们做类似的事情,但我还没有看到有人说他们完全一样.我认为他们不是.

Update 1: Ok, so Keras uses inverted dropout, though it is called dropout in the Keras documentation and code. The link http://cs231n.github.io/neural-networks-2/#reg doesn't seem to indicate that the two are equivalent. Nor does the answer at https://stats.stackexchange.com/questions/205932/dropout-scaling-the-activation-versus-inverting-the-dropout. I can see that they do similar things, but I have yet to see anyone say they are exactly the same. I think they are not.

那么一个新问题:dropout 和inverted dropout 是等价的吗?需要明确的是,我正在寻找数学上的理由来说明它们是或不是.

So a new question: Are dropout and inverted dropout equivalent? To be clear, I'm looking for mathematical justification for saying they are or aren't.

推荐答案

是的.它执行得当.从 Dropout 被发明的时候起——人们也从实现的角度改进了它.Keras 正在使用其中一种技术.它被称为反向辍学,您可以在此处阅读有关它的信息一>.

Yes. It is implemented properly. From the time when Dropout was invented - folks improved it also from the implementation point of view. Keras is using one of this techniques. It's called inverted dropout and you may read about it here.

更新:

老实说 - 在严格的数学意义上,这两种方法并不等价.在反例中,您将每个隐藏激活乘以 dropout 参数的倒数.但由于该导数是线性的,它等效于将所有梯度乘以相同的因子.为了克服这种差异,你必须设置不同的学习权重.从这个角度来看,这种方法是不同的.但从实际的角度来看 - 这种方法是等效的,因为:

To be honest - in the strict mathematical sense this two approaches are not equivalent. In inverted case you are multiplying every hidden activation by a reciprocal of dropout parameter. But due to that derivative is linear it is equivalent to multiplying all gradient by the same factor. To overcome this difference you must set different learning weight then. From this point of view this approaches differ. But from a practical point view - this approaches are equivalent because:

  1. 如果您使用自动设置学习率的方法(如 RMSProp 或 Adagrad) - 它几乎不会改变算法.
  2. 如果您使用一种自动设置学习率的方法 - 您必须考虑 dropout 的随机性以及由于某些神经元在训练阶段会被关闭(测试期间不会发生的情况/评估阶段)-您必须重新调整学习率以克服这种差异.概率理论为我们提供了最好的重新调用因子——它是 dropout 参数的倒数,这使得损失函数梯度长度的预期值在训练和测试/评估阶段相同.

当然 - 以上两点都是关于反向辍学技术.

Of course - both points above are about inverted dropout technique.

这篇关于辍学的Keras实现是否正确?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆