带有 'relu' 的 LSTM 'recurrent_dropout' 产生 NaN [英] LSTM 'recurrent_dropout' with 'relu' yields NaNs

查看:51
本文介绍了带有 'relu' 的 LSTM 'recurrent_dropout' 产生 NaN的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

任何非零 recurrent_dropout 都会产生 NaN 损失和权重;后者是 0 或 NaN.发生在堆叠、浅层、statefulreturn_sequences = 任何、&没有 Bidirectional()activation='relu'loss='binary_crossentropy'.NaN 出现在几个批次内.

有任何修复吗?感谢帮助.

<小时>尝试进行故障排除:

  • recurrent_dropout=0.2,0.1,0.01,1e-6
  • kernel_constraint=maxnorm(0.5,axis=0)
  • recurrent_constraint=maxnorm(0.5,axis=0)
  • clipnorm=50(经验确定),Nadam优化器
  • activation='tanh' - 无 NaN,权重稳定,测试多达 10 个批次
  • lr=2e-6,2e-5 - 没有 NaN,权重稳定,最多测试 10 个批次
  • lr=5e-5 - 没有 NaN,权重稳定,3 个批次 - 第 4 批次的 NaN
  • batch_shape=(32,48,16) - 2 个批次的损失较大,第 3 批次为 NaN

注意:batch_shape=(32,672,16),每批次 17 次调用 train_on_batch

<小时>环境:

  • Keras 2.2.4(TensorFlow 后端)、Python 3.7、Spyder 3.3.7(通过 Anaconda)
  • GTX 1070 6GB、i7-7700HQ、12GB 内存、Win-10.0.17134 x64
  • CuDNN 10+,最新的 Nvidia 驱动器
<小时>

附加信息:

模型发散是自发的,发生在不同的训练更新中即使是固定种子 - Numpy、Random 和 TensorFlow 随机种子.此外,当第一次发散时,LSTM 层的权重都是正常的 - 以后只会变成 NaN.

以下依次是: (1) LSTM 的输入;(2) LSTM 输出;(3) Dense(1,'sigmoid') 输出——三个是连续的,每个之间有 Dropout(0.5) .前面的 (1) 是 Conv1D 层.右图:LSTM 权重.BEFORE" = 1 次列车更新之前;AFTER = 1 次列车更新后

分歧前:

AT 分歧:

## LSTM 输出,扁平化,统计(mean,std) = (inf,nan)(min,max) = (0.00e+00,inf)(abs_min,abs_max) = (0.00e+00,inf)

分歧之后:

## 循环门权重:阵列([[南,南,南,...,南,南,南],[ 0., 0., -0., ..., -0., 0., 0.],[ 0., -0., -0., ..., -0., 0., 0.],...,[难,难,难,...,难,难,难],[ 0., 0., -0., ..., -0., 0., -0.],[ 0., 0., -0., ..., -0., 0., 0.]], dtype=float32)## 密集 Sigmoid 输出:数组([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]],dtype=float32)

<小时>最小可重复示例:

from keras.layers import Input,Dense,LSTM,Dropout从 keras.models 导入模型从 keras.optimizers 导入 Nadam从 keras.constraints 导入 MaxNorm 作为 maxnorm将 numpy 导入为 np

ipt = Input(batch_shape=(32,672,16))x = LSTM(512, activation='relu', return_sequences=False,recurrent_dropout=0.3,kernel_constraint =maxnorm(0.5,axis=0),recurrent_constraint=maxnorm(0.5,axis=0))(ipt)out = Dense(1, activation='sigmoid')(x)模型 = 模型(ipt,输出)优化器 = Nadam(lr=4e-4, clipnorm=1)模型.编译(优化器=优化器,损失='binary_crossentropy')

 for train_update,_ in enumerate(range(100)):x = np.random.randn(32,672,16)y = np.array([1]*5 + [0]*27)np.random.shuffle(y)损失 = model.train_on_batch(x,y)打印(train_update+1,损失,np.sum(y))

观察:以下加速分歧:

  • 更高 units (LSTM)
  • 更高层数(LSTM)
  • 更高 lr <<<=1e-4 时没有发散,测试到400列火车
  • Less '1' 标签 << 与下面的 y 没有分歧,即使是 lr=1e-3;测试了多达 400 列火车

y = np.random.randint(0,2,32) # 生成更多的1"标签

<小时>

UPDATE:在 TF2 中未修复;也可以使用 from tensorflow.keras 导入重现.

解决方案

深入研究 LSTM 公式并挖掘源代码,一切都变得清晰起来.

判决:recurrent_dropout 与此无关;一个东西在没人预料到的地方循环.

<小时>

真正的罪魁祸首:activation 参数,现在是 'relu',应用于循环转换 -与几乎所有将其显示为无害的 'tanh' 的教程相反.

即,activation 仅用于隐藏到输出转换 -

的 Git 问题

Any non-zero recurrent_dropout yields NaN losses and weights; latter are either 0 or NaN. Happens for stacked, shallow, stateful, return_sequences = any, with & w/o Bidirectional(), activation='relu', loss='binary_crossentropy'. NaNs occur within a few batches.

Any fixes? Help's appreciated.


TROUBLESHOOTING ATTEMPTED:

  • recurrent_dropout=0.2,0.1,0.01,1e-6
  • kernel_constraint=maxnorm(0.5,axis=0)
  • recurrent_constraint=maxnorm(0.5,axis=0)
  • clipnorm=50 (empirically determined), Nadam optimizer
  • activation='tanh' - no NaNs, weights stable, tested for up to 10 batches
  • lr=2e-6,2e-5 - no NaNs, weights stable, tested for up to 10 batches
  • lr=5e-5 - no NaNs, weights stable, for 3 batches - NaNs on batch 4
  • batch_shape=(32,48,16) - large loss for 2 batches, NaNs on batch 3

NOTE: batch_shape=(32,672,16), 17 calls to train_on_batch per batch


ENVIRONMENT:

  • Keras 2.2.4 (TensorFlow backend), Python 3.7, Spyder 3.3.7 via Anaconda
  • GTX 1070 6GB, i7-7700HQ, 12GB RAM, Win-10.0.17134 x64
  • CuDNN 10+, latest Nvidia drives

ADDITIONAL INFO:

Model divergence is spontaneous, occurring at different train updates even with fixed seeds - Numpy, Random, and TensorFlow random seeds. Furthermore, when first diverging, LSTM layer weights are all normal - only going to NaN later.

Below are, in order: (1) inputs to LSTM; (2) LSTM outputs; (3) Dense(1,'sigmoid') outputs -- the three are consecutive, with Dropout(0.5) between each. Preceding (1) are Conv1D layers. Right: LSTM weights. "BEFORE" = 1 train update before; "AFTER = 1 train update after

BEFORE divergence:

AT divergence:

## LSTM outputs, flattened, stats
(mean,std)        = (inf,nan)
(min,max)         = (0.00e+00,inf)
(abs_min,abs_max) = (0.00e+00,inf)

AFTER divergence:

## Recurrent Gates Weights:
array([[nan, nan, nan, ..., nan, nan, nan],
       [ 0.,  0., -0., ..., -0.,  0.,  0.],
       [ 0., -0., -0., ..., -0.,  0.,  0.],
       ...,
       [nan, nan, nan, ..., nan, nan, nan],
       [ 0.,  0., -0., ..., -0.,  0., -0.],
       [ 0.,  0., -0., ..., -0.,  0.,  0.]], dtype=float32)
## Dense Sigmoid Outputs:
array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]], dtype=float32)


MINIMAL REPRODUCIBLE EXAMPLE:

from keras.layers import Input,Dense,LSTM,Dropout
from keras.models import Model
from keras.optimizers  import Nadam 
from keras.constraints import MaxNorm as maxnorm
import numpy as np

ipt = Input(batch_shape=(32,672,16))
x = LSTM(512, activation='relu', return_sequences=False,
              recurrent_dropout=0.3,
              kernel_constraint   =maxnorm(0.5, axis=0),
              recurrent_constraint=maxnorm(0.5, axis=0))(ipt)
out = Dense(1, activation='sigmoid')(x)

model = Model(ipt,out)
optimizer = Nadam(lr=4e-4, clipnorm=1)
model.compile(optimizer=optimizer,loss='binary_crossentropy')

for train_update,_ in enumerate(range(100)):
    x = np.random.randn(32,672,16)
    y = np.array([1]*5 + [0]*27)
    np.random.shuffle(y)
    loss = model.train_on_batch(x,y)
    print(train_update+1,loss,np.sum(y))

Observations: the following speed up divergence:

  • Higher units (LSTM)
  • Higher # of layers (LSTM)
  • Higher lr << no divergence when <=1e-4, tested up to 400 trains
  • Less '1' labels << no divergence with y below, even with lr=1e-3; tested up to 400 trains

y = np.random.randint(0,2,32) # makes more '1' labels


UPDATE: not fixed in TF2; reproducible also using from tensorflow.keras imports.

解决方案

Studying LSTM formulae deeper and digging into the source code, everything's come crystal clear.

Verdict: recurrent_dropout has nothing to do with it; a thing's being looped where none expect it.


Actual culprit: the activation argument, now 'relu', is applied on the recurrent transformations - contrary to virtually every tutorial showing it as the harmless 'tanh'.

I.e., activation is not only for the hidden-to-output transform - source code; it operates directly on computing both recurrent states, cell and hidden:

c = f * c_tm1 + i * self.activation(x_c + K.dot(h_tm1_c, self.recurrent_kernel_c))
h = o * self.activation(c)


Solution(s):

  • Apply BatchNormalization to LSTM's inputs, especially if previous layer's outputs are unbounded (ReLU, ELU, etc)
    • If previous layer's activations are tightly bounded (e.g. tanh, sigmoid), apply BN before activations (use activation=None, then BN, then Activation layer)
  • Use activation='selu'; more stable, but can still diverge
  • Use lower lr
  • Apply gradient clipping
  • Use fewer timesteps

More answers, to some remaining questions:

  • Why was recurrent_dropout suspected? Unmeticulous testing setup; only now did I focus on forcing divergence without it. It did however, sometimes accelerate divergence - which may be explained by by it zeroing the non-relu contributions that'd otherwise offset multiplicative reinforcement.
  • Why do nonzero mean inputs accelerate divergence? Additive symmetry; nonzero-mean distributions are asymmetric, with one sign dominating - facilitating large pre-activations, hence large ReLUs.
  • Why can training be stable for hundreds of iterations with a low lr? Extreme activations induce large gradients via large error; with a low lr, this means weights adjust to prevent such activations - whereas a high lr jumps too far too quickly.
  • Why do stacked LSTMs diverge faster? In addition to feeding ReLUs to itself, LSTM feeds the next LSTM, which then feeds itself the ReLU'd ReLU's --> fireworks.

UPDATE 1/22/2020: recurrent_dropout may in fact be a contributing factor, as it utilizes inverted dropout, upscaling hidden transformations during training, easing divergent behavior over many timesteps. Git Issue on this here

这篇关于带有 'relu' 的 LSTM 'recurrent_dropout' 产生 NaN的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆