具有'relu'的LSTM'recurrent_dropout'产生NaN [英] LSTM 'recurrent_dropout' with 'relu' yields NaNs

查看:228
本文介绍了具有'relu'的LSTM'recurrent_dropout'产生NaN的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

任何非零的recurrent_dropout都会产生NaN损失和权重;后者是0或NaN.发生堆叠且浅的statefulreturn_sequences =任何,且& w/o Bidirectional()activation='relu'loss='binary_crossentropy'. NaN在几批内发生.

Any non-zero recurrent_dropout yields NaN losses and weights; latter are either 0 or NaN. Happens for stacked, shallow, stateful, return_sequences = any, with & w/o Bidirectional(), activation='relu', loss='binary_crossentropy'. NaNs occur within a few batches.

任何修复程序?帮助表示赞赏.

Any fixes? Help's appreciated.


尝试故障排除:

  • recurrent_dropout=0.2,0.1,0.01,1e-6
  • kernel_constraint=maxnorm(0.5,axis=0)
  • recurrent_constraint=maxnorm(0.5,axis=0)
  • clipnorm=50(根据经验确定),Nadam优化程序
  • activation='tanh'-无NaN,重量稳定,最多可测试10个批次
  • lr=2e-6,2e-5-无NaN,重量稳定,最多可测试10个批次
  • lr=5e-5-3批次均无NaN,重量稳定-第4批次为NaNs
  • batch_shape=(32,48,16)-2批次的损失很大,第3批次的NaNs
  • recurrent_dropout=0.2,0.1,0.01,1e-6
  • kernel_constraint=maxnorm(0.5,axis=0)
  • recurrent_constraint=maxnorm(0.5,axis=0)
  • clipnorm=50 (empirically determined), Nadam optimizer
  • activation='tanh' - no NaNs, weights stable, tested for up to 10 batches
  • lr=2e-6,2e-5 - no NaNs, weights stable, tested for up to 10 batches
  • lr=5e-5 - no NaNs, weights stable, for 3 batches - NaNs on batch 4
  • batch_shape=(32,48,16) - large loss for 2 batches, NaNs on batch 3

注意:batch_shape=(32,672,16),每批17次调用train_on_batch

NOTE: batch_shape=(32,672,16), 17 calls to train_on_batch per batch


环境:

  • Keras 2.2.4(TensorFlow后端),Python 3.7,Spyder 3.3.7(通过Anaconda)
  • GTX 1070 6GB,i7-7700HQ,12GB RAM,Win-10.0.17134 x64
  • CuDNN 10+,最新的Nvidia驱动器

附加信息:

模型发散是自发的,甚至在具有固定种子的情况下,也会在不同的火车更新中发生.-Numpy,Random和TensorFlow随机种子.此外,当第一次发散时,LSTM层权重都是正常的-后来才使用NaN.

Model divergence is spontaneous, occurring at different train updates even with fixed seeds - Numpy, Random, and TensorFlow random seeds. Furthermore, when first diverging, LSTM layer weights are all normal - only going to NaN later.

以下是按顺序排列的:(1)输入到LSTM; (2)LSTM输出; (3)Dense(1,'sigmoid')输出-三个是连续的,每个之间是Dropout(0.5).前面(1)是Conv1D层.右:LSTM砝码. 之前" = 1次火车更新之前; 之后= 1趟火车更新

Below are, in order: (1) inputs to LSTM; (2) LSTM outputs; (3) Dense(1,'sigmoid') outputs -- the three are consecutive, with Dropout(0.5) between each. Preceding (1) are Conv1D layers. Right: LSTM weights. "BEFORE" = 1 train update before; "AFTER = 1 train update after

发散之前:

AT差异:

## LSTM outputs, flattened, stats
(mean,std)        = (inf,nan)
(min,max)         = (0.00e+00,inf)
(abs_min,abs_max) = (0.00e+00,inf)

后发散:

## Recurrent Gates Weights:
array([[nan, nan, nan, ..., nan, nan, nan],
       [ 0.,  0., -0., ..., -0.,  0.,  0.],
       [ 0., -0., -0., ..., -0.,  0.,  0.],
       ...,
       [nan, nan, nan, ..., nan, nan, nan],
       [ 0.,  0., -0., ..., -0.,  0., -0.],
       [ 0.,  0., -0., ..., -0.,  0.,  0.]], dtype=float32)
## Dense Sigmoid Outputs:
array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]], dtype=float32)


最小可复制示例:

from keras.layers import Input,Dense,LSTM,Dropout
from keras.models import Model
from keras.optimizers  import Nadam 
from keras.constraints import MaxNorm as maxnorm
import numpy as np

ipt = Input(batch_shape=(32,672,16))
x = LSTM(512, activation='relu', return_sequences=False,
              recurrent_dropout=0.3,
              kernel_constraint   =maxnorm(0.5, axis=0),
              recurrent_constraint=maxnorm(0.5, axis=0))(ipt)
out = Dense(1, activation='sigmoid')(x)

model = Model(ipt,out)
optimizer = Nadam(lr=4e-4, clipnorm=1)
model.compile(optimizer=optimizer,loss='binary_crossentropy')

for train_update,_ in enumerate(range(100)):
    x = np.random.randn(32,672,16)
    y = np.array([1]*5 + [0]*27)
    np.random.shuffle(y)
    loss = model.train_on_batch(x,y)
    print(train_update+1,loss,np.sum(y))

观察:以下加快分歧:

  • 高级 units(LSTM)
  • 较高层数(LSTM)
  • 更高 lr << <=1e-4最多可测试400列火车时没有发散
  • 更少 '1'标签<< 与下面的y没有差异,即使与lr=1e-3也没有差异;测试了多达400列火车
  • Higher units (LSTM)
  • Higher # of layers (LSTM)
  • Higher lr << no divergence when <=1e-4, tested up to 400 trains
  • Less '1' labels << no divergence with y below, even with lr=1e-3; tested up to 400 trains

y = np.random.randint(0,2,32) # makes more '1' labels

更新:在TF2中未修复;也可以使用from tensorflow.keras导入来复制.

UPDATE: not fixed in TF2; reproducible also using from tensorflow.keras imports.

推荐答案

更深入地研究LSTM公式并深入研究源代码,一切都变得十分清晰-如果不是仅仅通过阅读问题对您来说,从这个答案中可以学到一些东西.

Studying LSTM formulae deeper and digging into the source code, everything's come crystal clear - and if it isn't to you just from reading the question, then you have something to learn from this answer.

判决:recurrent_dropout与它无关.一件事被循环到没人期望的地方.

Verdict: recurrent_dropout has nothing to do with it; a thing's being looped where none expect it.

实际罪魁祸首:activation参数(现为'relu')应用于循环转换-几乎每个教程都将其显示为无害的.

Actual culprit: the activation argument, now 'relu', is applied on the recurrent transformations - contrary to virtually every tutorial showing it as the harmless 'tanh'.

即,activation不是不是,仅用于从隐藏到输出的转换-

I.e., activation is not only for the hidden-to-output transform - source code; it operates directly on computing both recurrent states, cell and hidden:

c = f * c_tm1 + i * self.activation(x_c + K.dot(h_tm1_c, self.recurrent_kernel_c))
h = o * self.activation(c)

解决方案:

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆