具有'relu'的LSTM'recurrent_dropout'产生NaN [英] LSTM 'recurrent_dropout' with 'relu' yields NaNs
问题描述
任何非零的recurrent_dropout
都会产生NaN损失和权重;后者是0或NaN.发生堆叠且浅的stateful
,return_sequences
=任何,且& w/o Bidirectional()
,activation='relu'
,loss='binary_crossentropy'
. NaN在几批内发生.
Any non-zero recurrent_dropout
yields NaN losses and weights; latter are either 0 or NaN. Happens for stacked, shallow, stateful
, return_sequences
= any, with & w/o Bidirectional()
, activation='relu'
, loss='binary_crossentropy'
. NaNs occur within a few batches.
任何修复程序?帮助表示赞赏.
Any fixes? Help's appreciated.
尝试故障排除:
-
recurrent_dropout=0.2,0.1,0.01,1e-6
-
kernel_constraint=maxnorm(0.5,axis=0)
-
recurrent_constraint=maxnorm(0.5,axis=0)
-
clipnorm=50
(根据经验确定),Nadam优化程序 -
activation='tanh'
-无NaN,重量稳定,最多可测试10个批次 -
lr=2e-6,2e-5
-无NaN,重量稳定,最多可测试10个批次 -
lr=5e-5
-3批次均无NaN,重量稳定-第4批次为NaNs -
batch_shape=(32,48,16)
-2批次的损失很大,第3批次的NaNs
recurrent_dropout=0.2,0.1,0.01,1e-6
kernel_constraint=maxnorm(0.5,axis=0)
recurrent_constraint=maxnorm(0.5,axis=0)
clipnorm=50
(empirically determined), Nadam optimizeractivation='tanh'
- no NaNs, weights stable, tested for up to 10 batcheslr=2e-6,2e-5
- no NaNs, weights stable, tested for up to 10 batcheslr=5e-5
- no NaNs, weights stable, for 3 batches - NaNs on batch 4batch_shape=(32,48,16)
- large loss for 2 batches, NaNs on batch 3
注意:batch_shape=(32,672,16)
,每批17次调用train_on_batch
NOTE: batch_shape=(32,672,16)
, 17 calls to train_on_batch
per batch
环境:
- Keras 2.2.4(TensorFlow后端),Python 3.7,Spyder 3.3.7(通过Anaconda)
- GTX 1070 6GB,i7-7700HQ,12GB RAM,Win-10.0.17134 x64
- CuDNN 10+,最新的Nvidia驱动器
附加信息:
模型发散是自发的,甚至在具有固定种子的情况下,也会在不同的火车更新中发生.-Numpy,Random和TensorFlow随机种子.此外,当第一次发散时,LSTM层权重都是正常的-后来才使用NaN.
Model divergence is spontaneous, occurring at different train updates even with fixed seeds - Numpy, Random, and TensorFlow random seeds. Furthermore, when first diverging, LSTM layer weights are all normal - only going to NaN later.
以下是按顺序排列的:(1)输入到LSTM
; (2)LSTM
输出; (3)Dense(1,'sigmoid')
输出-三个是连续的,每个之间是Dropout(0.5)
.前面(1)是Conv1D
层.右:LSTM砝码. 之前" = 1次火车更新之前; 之后= 1趟火车更新
Below are, in order: (1) inputs to LSTM
; (2) LSTM
outputs; (3) Dense(1,'sigmoid')
outputs -- the three are consecutive, with Dropout(0.5)
between each. Preceding (1) are Conv1D
layers. Right: LSTM weights. "BEFORE" = 1 train update before; "AFTER = 1 train update after
发散之前:
AT差异:
## LSTM outputs, flattened, stats
(mean,std) = (inf,nan)
(min,max) = (0.00e+00,inf)
(abs_min,abs_max) = (0.00e+00,inf)
后发散:
## Recurrent Gates Weights:
array([[nan, nan, nan, ..., nan, nan, nan],
[ 0., 0., -0., ..., -0., 0., 0.],
[ 0., -0., -0., ..., -0., 0., 0.],
...,
[nan, nan, nan, ..., nan, nan, nan],
[ 0., 0., -0., ..., -0., 0., -0.],
[ 0., 0., -0., ..., -0., 0., 0.]], dtype=float32)
## Dense Sigmoid Outputs:
array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]], dtype=float32)
最小可复制示例:
from keras.layers import Input,Dense,LSTM,Dropout
from keras.models import Model
from keras.optimizers import Nadam
from keras.constraints import MaxNorm as maxnorm
import numpy as np
ipt = Input(batch_shape=(32,672,16))
x = LSTM(512, activation='relu', return_sequences=False,
recurrent_dropout=0.3,
kernel_constraint =maxnorm(0.5, axis=0),
recurrent_constraint=maxnorm(0.5, axis=0))(ipt)
out = Dense(1, activation='sigmoid')(x)
model = Model(ipt,out)
optimizer = Nadam(lr=4e-4, clipnorm=1)
model.compile(optimizer=optimizer,loss='binary_crossentropy')
for train_update,_ in enumerate(range(100)):
x = np.random.randn(32,672,16)
y = np.array([1]*5 + [0]*27)
np.random.shuffle(y)
loss = model.train_on_batch(x,y)
print(train_update+1,loss,np.sum(y))
观察:以下加快分歧:
- 高级
units
(LSTM) - 较高层数(LSTM)
- 更高
lr
<<<=1e-4
最多可测试400列火车时没有发散 - 更少
'1'
标签<< 与下面的y
没有差异,即使与lr=1e-3
也没有差异;测试了多达400列火车
- Higher
units
(LSTM) - Higher # of layers (LSTM)
- Higher
lr
<< no divergence when<=1e-4
, tested up to 400 trains - Less
'1'
labels << no divergence withy
below, even withlr=1e-3
; tested up to 400 trains
y = np.random.randint(0,2,32) # makes more '1' labels
更新:在TF2中未修复;也可以使用from tensorflow.keras
导入来复制.
UPDATE: not fixed in TF2; reproducible also using from tensorflow.keras
imports.
推荐答案
更深入地研究LSTM公式并深入研究源代码,一切都变得十分清晰-如果不是仅仅通过阅读问题对您来说,从这个答案中可以学到一些东西.
Studying LSTM formulae deeper and digging into the source code, everything's come crystal clear - and if it isn't to you just from reading the question, then you have something to learn from this answer.
判决:recurrent_dropout
与它无关.一件事被循环到没人期望的地方.
Verdict: recurrent_dropout
has nothing to do with it; a thing's being looped where none expect it.
实际罪魁祸首:activation
参数(现为'relu'
)应用于循环转换-几乎每个教程都将其显示为无害的
Actual culprit: the activation
argument, now 'relu'
, is applied on the recurrent transformations - contrary to virtually every tutorial showing it as the harmless 'tanh'
.
即,activation
不是不是,仅用于从隐藏到输出的转换-
I.e., activation
is not only for the hidden-to-output transform - source code; it operates directly on computing both recurrent states, cell and hidden:
c = f * c_tm1 + i * self.activation(x_c + K.dot(h_tm1_c, self.recurrent_kernel_c))
h = o * self.activation(c)
解决方案:
- 将
BatchNormalization
应用于LSTM的输入,特别是 ,如果上一层的输出不受限制(ReLU,ELU等)- 如果前一层的激活紧密受限(例如tanh,Sigmoid),请在激活之前先应用BN (先使用
activation=None
,然后使用BN,然后再使用Activation
层)
- Apply
BatchNormalization
to LSTM's inputs, especially if previous layer's outputs are unbounded (ReLU, ELU, etc)- If previous layer's activations are tightly bounded (e.g. tanh, sigmoid), apply BN before activations (use
activation=None
, then BN, thenActivation
layer)
更多答案,还剩下一些问题:
- 为什么怀疑是
recurrent_dropout
?精心的测试设置;直到现在,我才专注于在没有分歧的情况下强迫分歧.但是,它的确有时会加速发散-可以通过将非relu贡献归零来解释,否则这些贡献会抵消乘法补强. - 为什么非零均值输入会加速散度?加性对称;非零均值分布是不对称的,以一个符号为主-促进了较大的预激活,因此具有较大的ReLU.
- 为什么训练对lr低的数百次迭代稳定?极端的激活会由于较大的误差而导致较大的梯度; lr较低时,这意味着权重会进行调整以防止此类激活-而lr较高时,跳得太快了.
- 为什么堆叠式LSTM的发散速度更快?除了向自身提供ReLU之外,LSTM还提供下一个LSTM,然后再向其提供ReLU'd ReLU的->烟花.
- Why was
recurrent_dropout
suspected? Unmeticulous testing setup; only now did I focus on forcing divergence without it. It did however, sometimes accelerate divergence - which may be explained by by it zeroing the non-relu contributions that'd otherwise offset multiplicative reinforcement. - Why do nonzero mean inputs accelerate divergence? Additive symmetry; nonzero-mean distributions are asymmetric, with one sign dominating - facilitating large pre-activations, hence large ReLUs.
- Why can training be stable for hundreds of iterations with a low lr? Extreme activations induce large gradients via large error; with a low lr, this means weights adjust to prevent such activations - whereas a high lr jumps too far too quickly.
- Why do stacked LSTMs diverge faster? In addition to feeding ReLUs to itself, LSTM feeds the next LSTM, which then feeds itself the ReLU'd ReLU's --> fireworks.
UPDATE 1/22/2020 :
recurrent_dropout
实际上可能是一个促成因素,因为它利用了倒置辍学,在训练过程中扩大了隐藏的转换,减轻了发散行为经过很多时间.在此处UPDATE 1/22/2020:
recurrent_dropout
may in fact be a contributing factor, as it utilizes inverted dropout, upscaling hidden transformations during training, easing divergent behavior over many timesteps. Git Issue on this here这篇关于具有'relu'的LSTM'recurrent_dropout'产生NaN的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
- If previous layer's activations are tightly bounded (e.g. tanh, sigmoid), apply BN before activations (use
- 如果前一层的激活紧密受限(例如tanh,Sigmoid),请在激活之前先应用BN (先使用