具有'relu'的LSTM'recurrent_dropout'产生NaN [英] LSTM 'recurrent_dropout' with 'relu' yields NaNs

查看：228 发布时间：2020/4/25 10:03:46 tensorflow keras lstm numerical-stability

本文介绍了具有'relu'的LSTM'recurrent_dropout'产生NaN的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

任何非零的recurrent_dropout都会产生NaN损失和权重；后者是0或NaN.发生堆叠且浅的stateful，return_sequences =任何，且& w/o Bidirectional()，activation='relu'，loss='binary_crossentropy'. NaN在几批内发生.

Any non-zero recurrent_dropout yields NaN losses and weights; latter are either 0 or NaN. Happens for stacked, shallow, stateful, return_sequences = any, with & w/o Bidirectional(), activation='relu', loss='binary_crossentropy'. NaNs occur within a few batches.

任何修复程序?帮助表示赞赏.

Any fixes? Help's appreciated.

尝试故障排除:

recurrent_dropout=0.2,0.1,0.01,1e-6
kernel_constraint=maxnorm(0.5,axis=0)
recurrent_constraint=maxnorm(0.5,axis=0)
clipnorm=50(根据经验确定)，Nadam优化程序
activation='tanh'-无NaN，重量稳定，最多可测试10个批次
lr=2e-6,2e-5-无NaN，重量稳定，最多可测试10个批次
lr=5e-5-3批次均无NaN，重量稳定-第4批次为NaNs
batch_shape=(32,48,16)-2批次的损失很大，第3批次的NaNs

recurrent_dropout=0.2,0.1,0.01,1e-6
kernel_constraint=maxnorm(0.5,axis=0)
recurrent_constraint=maxnorm(0.5,axis=0)
clipnorm=50 (empirically determined), Nadam optimizer
activation='tanh' - no NaNs, weights stable, tested for up to 10 batches
lr=2e-6,2e-5 - no NaNs, weights stable, tested for up to 10 batches
lr=5e-5 - no NaNs, weights stable, for 3 batches - NaNs on batch 4
batch_shape=(32,48,16) - large loss for 2 batches, NaNs on batch 3

注意:batch_shape=(32,672,16)，每批17次调用train_on_batch

NOTE: batch_shape=(32,672,16), 17 calls to train_on_batch per batch

环境:

Keras 2.2.4(TensorFlow后端)，Python 3.7，Spyder 3.3.7(通过Anaconda)
GTX 1070 6GB，i7-7700HQ，12GB RAM，Win-10.0.17134 x64
CuDNN 10+，最新的Nvidia驱动器

附加信息:

模型发散是自发的，甚至在具有固定种子的情况下，也会在不同的火车更新中发生.-Numpy，Random和TensorFlow随机种子.此外，当第一次发散时，LSTM层权重都是正常的-后来才使用NaN.

Model divergence is spontaneous, occurring at different train updates even with fixed seeds - Numpy, Random, and TensorFlow random seeds. Furthermore, when first diverging, LSTM layer weights are all normal - only going to NaN later.

以下是按顺序排列的:(1)输入到LSTM； (2)LSTM输出； (3)Dense(1,'sigmoid')输出-三个是连续的，每个之间是Dropout(0.5).前面(1)是Conv1D层.右:LSTM砝码. 之前" = 1次火车更新之前；之后= 1趟火车更新

Below are, in order: (1) inputs to LSTM; (2) LSTM outputs; (3) Dense(1,'sigmoid') outputs -- the three are consecutive, with Dropout(0.5) between each. Preceding (1) are Conv1D layers. Right: LSTM weights. "BEFORE" = 1 train update before; "AFTER = 1 train update after

发散之前:

AT差异:

## LSTM outputs, flattened, stats
(mean,std)        = (inf,nan)
(min,max)         = (0.00e+00,inf)
(abs_min,abs_max) = (0.00e+00,inf)

后发散:

## Recurrent Gates Weights:
array([[nan, nan, nan, ..., nan, nan, nan],
       [ 0.,  0., -0., ..., -0.,  0.,  0.],
       [ 0., -0., -0., ..., -0.,  0.,  0.],
       ...,
       [nan, nan, nan, ..., nan, nan, nan],
       [ 0.,  0., -0., ..., -0.,  0., -0.],
       [ 0.,  0., -0., ..., -0.,  0.,  0.]], dtype=float32)
## Dense Sigmoid Outputs:
array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]], dtype=float32)

最小可复制示例:

from keras.layers import Input,Dense,LSTM,Dropout
from keras.models import Model
from keras.optimizers  import Nadam 
from keras.constraints import MaxNorm as maxnorm
import numpy as np

ipt = Input(batch_shape=(32,672,16))
x = LSTM(512, activation='relu', return_sequences=False,
              recurrent_dropout=0.3,
              kernel_constraint   =maxnorm(0.5, axis=0),
              recurrent_constraint=maxnorm(0.5, axis=0))(ipt)
out = Dense(1, activation='sigmoid')(x)

model = Model(ipt,out)
optimizer = Nadam(lr=4e-4, clipnorm=1)
model.compile(optimizer=optimizer,loss='binary_crossentropy')

for train_update,_ in enumerate(range(100)):
    x = np.random.randn(32,672,16)
    y = np.array([1]*5 + [0]*27)
    np.random.shuffle(y)
    loss = model.train_on_batch(x,y)
    print(train_update+1,loss,np.sum(y))

观察:以下加快分歧:

高级 units(LSTM)
较高层数(LSTM)
更高 lr << <=1e-4最多可测试400列火车时没有发散
更少 '1'标签<< 与下面的y没有差异，即使与lr=1e-3也没有差异；测试了多达400列火车

Higher units (LSTM)
Higher # of layers (LSTM)
Higher lr << no divergence when <=1e-4, tested up to 400 trains
Less '1' labels << no divergence with y below, even with lr=1e-3; tested up to 400 trains

y = np.random.randint(0,2,32) # makes more '1' labels

更新:在TF2中未修复；也可以使用from tensorflow.keras导入来复制.

UPDATE: not fixed in TF2; reproducible also using from tensorflow.keras imports.

推荐答案

更深入地研究LSTM公式并深入研究源代码，一切都变得十分清晰-如果不是仅仅通过阅读问题对您来说，从这个答案中可以学到一些东西.

Studying LSTM formulae deeper and digging into the source code, everything's come crystal clear - and if it isn't to you just from reading the question, then you have something to learn from this answer.

判决:recurrent_dropout与它无关.一件事被循环到没人期望的地方.

Verdict: recurrent_dropout has nothing to do with it; a thing's being looped where none expect it.

实际罪魁祸首:activation参数(现为'relu')应用于循环转换-几乎每个教程都将其显示为无害的.

Actual culprit: the activation argument, now 'relu', is applied on the recurrent transformations - contrary to virtually every tutorial showing it as the harmless 'tanh'.

即，activation不是不是，仅用于从隐藏到输出的转换-

I.e., activation is not only for the hidden-to-output transform - source code; it operates directly on computing both recurrent states, cell and hidden:

c = f * c_tm1 + i * self.activation(x_c + K.dot(h_tm1_c, self.recurrent_kernel_c))
h = o * self.activation(c)

解决方案:

将BatchNormalization应用于LSTM的输入，特别是，如果上一层的输出不受限制(ReLU，ELU等)
- 如果前一层的激活紧密受限(例如tanh，Sigmoid)，请在激活之前先应用BN (先使用activation=None，然后使用BN，然后再使用Activation层)
Apply BatchNormalization to LSTM's inputs, especially if previous layer's outputs are unbounded (ReLU, ELU, etc)

If previous layer's activations are tightly bounded (e.g. tanh, sigmoid), apply BN before activations (use activation=None, then BN, then Activation layer)

更多答案，还剩下一些问题:

为什么怀疑是recurrent_dropout?精心的测试设置；直到现在，我才专注于在没有分歧的情况下强迫分歧.但是，它的确有时会加速发散-可以通过将非relu贡献归零来解释，否则这些贡献会抵消乘法补强.

为什么非零均值输入会加速散度?加性对称；非零均值分布是不对称的，以一个符号为主-促进了较大的预激活，因此具有较大的ReLU.

为什么训练对lr低的数百次迭代稳定?极端的激活会由于较大的误差而导致较大的梯度； lr较低时，这意味着权重会进行调整以防止此类激活-而lr较高时，跳得太快了.

为什么堆叠式LSTM的发散速度更快?除了向自身提供ReLU之外，LSTM还提供下一个LSTM，然后再向其提供ReLU'd ReLU的->烟花.

Why was recurrent_dropout suspected? Unmeticulous testing setup; only now did I focus on forcing divergence without it. It did however, sometimes accelerate divergence - which may be explained by by it zeroing the non-relu contributions that'd otherwise offset multiplicative reinforcement.

Why do nonzero mean inputs accelerate divergence? Additive symmetry; nonzero-mean distributions are asymmetric, with one sign dominating - facilitating large pre-activations, hence large ReLUs.

Why can training be stable for hundreds of iterations with a low lr? Extreme activations induce large gradients via large error; with a low lr, this means weights adjust to prevent such activations - whereas a high lr jumps too far too quickly.

Why do stacked LSTMs diverge faster? In addition to feeding ReLUs to itself, LSTM feeds the next LSTM, which then feeds itself the ReLU'd ReLU's --> fireworks.

UPDATE 1/22/2020 :recurrent_dropout实际上可能是一个促成因素，因为它利用了倒置辍学，在训练过程中扩大了隐藏的转换，减轻了发散行为经过很多时间.在此处

UPDATE 1/22/2020: recurrent_dropout may in fact be a contributing factor, as it utilizes inverted dropout, upscaling hidden transformations during training, easing divergent behavior over many timesteps. Git Issue on this here

这篇关于具有'relu'的LSTM'recurrent_dropout'产生NaN的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

具有'relu'的LSTM'recurrent_dropout'产生NaN [英] LSTM 'recurrent_dropout' with 'relu' yields NaNs

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

具有'relu'的LSTM'recurrent_dropout'产生NaN [英] LSTM &#39;recurrent_dropout&#39; with &#39;relu&#39; yields NaNs

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

具有'relu'的LSTM'recurrent_dropout'产生NaN [英] LSTM 'recurrent_dropout' with 'relu' yields NaNs

登录关闭