带有 'relu' 的 LSTM 'recurrent_dropout' 产生 NaN [英] LSTM 'recurrent_dropout' with 'relu' yields NaNs

查看：51 发布时间：2021/12/9 22:29:35 tensorflow keras lstm numerical-stability

本文介绍了带有 'relu' 的 LSTM 'recurrent_dropout' 产生 NaN的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

任何非零 recurrent_dropout 都会产生 NaN 损失和权重；后者是 0 或 NaN.发生在堆叠、浅层、stateful、return_sequences = 任何、&没有 Bidirectional()、activation='relu'、loss='binary_crossentropy'.NaN 出现在几个批次内.

有任何修复吗?感谢帮助.

<小时>尝试进行故障排除:

recurrent_dropout=0.2,0.1,0.01,1e-6
kernel_constraint=maxnorm(0.5,axis=0)
recurrent_constraint=maxnorm(0.5,axis=0)
clipnorm=50(经验确定)，Nadam优化器
activation='tanh' - 无 NaN，权重稳定，测试多达 10 个批次
lr=2e-6,2e-5 - 没有 NaN，权重稳定，最多测试 10 个批次
lr=5e-5 - 没有 NaN，权重稳定，3 个批次 - 第 4 批次的 NaN
batch_shape=(32,48,16) - 2 个批次的损失较大，第 3 批次为 NaN

注意:batch_shape=(32,672,16)，每批次 17 次调用 train_on_batch

<小时>环境:

Keras 2.2.4(TensorFlow 后端)、Python 3.7、Spyder 3.3.7(通过 Anaconda)
GTX 1070 6GB、i7-7700HQ、12GB 内存、Win-10.0.17134 x64
CuDNN 10+，最新的 Nvidia 驱动器

<小时>

附加信息:

模型发散是自发的，发生在不同的训练更新中即使是固定种子 - Numpy、Random 和 TensorFlow 随机种子.此外，当第一次发散时，LSTM 层的权重都是正常的 - 以后只会变成 NaN.

以下依次是: (1) LSTM 的输入；(2) LSTM 输出；(3) Dense(1,'sigmoid') 输出——三个是连续的，每个之间有 Dropout(0.5) .前面的 (1) 是 Conv1D 层.右图:LSTM 权重.BEFORE" = 1 次列车更新之前；AFTER = 1 次列车更新后

分歧前:

AT 分歧:

## LSTM 输出，扁平化，统计(mean,std) = (inf,nan)(min,max) = (0.00e+00,inf)(abs_min,abs_max) = (0.00e+00,inf)

分歧之后:

## 循环门权重:阵列([[南，南，南，...，南，南，南]，[ 0., 0., -0., ..., -0., 0., 0.],[ 0., -0., -0., ..., -0., 0., 0.],...,[难，难，难，...，难，难，难]，[ 0., 0., -0., ..., -0., 0., -0.],[ 0., 0., -0., ..., -0., 0., 0.]], dtype=float32)## 密集 Sigmoid 输出:数组([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]],dtype=float32)

<小时>最小可重复示例:

from keras.layers import Input,Dense,LSTM,Dropout从 keras.models 导入模型从 keras.optimizers 导入 Nadam从 keras.constraints 导入 MaxNorm 作为 maxnorm将 numpy 导入为 np

ipt = Input(batch_shape=(32,672,16))x = LSTM(512, activation='relu', return_sequences=False,recurrent_dropout=0.3,kernel_constraint =maxnorm(0.5,axis=0),recurrent_constraint=maxnorm(0.5,axis=0))(ipt)out = Dense(1, activation='sigmoid')(x)模型 = 模型(ipt，输出)优化器 = Nadam(lr=4e-4, clipnorm=1)模型.编译(优化器=优化器，损失='binary_crossentropy')

 for train_update,_ in enumerate(range(100)):x = np.random.randn(32,672,16)y = np.array([1]*5 + [0]*27)np.random.shuffle(y)损失 = model.train_on_batch(x,y)打印(train_update+1，损失，np.sum(y))

观察:以下加速分歧:

更高 units (LSTM)
更高层数(LSTM)
更高 lr <<<=1e-4 时没有发散，测试到400列火车
Less '1' 标签 << 与下面的 y 没有分歧，即使是 lr=1e-3;测试了多达 400 列火车

y = np.random.randint(0,2,32) # 生成更多的1"标签

<小时>

UPDATE:在 TF2 中未修复；也可以使用 from tensorflow.keras 导入重现.

解决方案

深入研究 LSTM 公式并挖掘源代码，一切都变得清晰起来.

判决:recurrent_dropout 与此无关；一个东西在没人预料到的地方循环.

<小时>

真正的罪魁祸首:activation 参数，现在是 'relu'，应用于循环转换 -与几乎所有将其显示为无害的 'tanh' 的教程相反.

即，activation 不仅用于隐藏到输出转换 -

的 Git 问题

Any non-zero recurrent_dropout yields NaN losses and weights; latter are either 0 or NaN. Happens for stacked, shallow, stateful, return_sequences = any, with & w/o Bidirectional(), activation='relu', loss='binary_crossentropy'. NaNs occur within a few batches.

Any fixes? Help's appreciated.

TROUBLESHOOTING ATTEMPTED:

recurrent_dropout=0.2,0.1,0.01,1e-6
kernel_constraint=maxnorm(0.5,axis=0)
recurrent_constraint=maxnorm(0.5,axis=0)
clipnorm=50 (empirically determined), Nadam optimizer
activation='tanh' - no NaNs, weights stable, tested for up to 10 batches
lr=2e-6,2e-5 - no NaNs, weights stable, tested for up to 10 batches
lr=5e-5 - no NaNs, weights stable, for 3 batches - NaNs on batch 4
batch_shape=(32,48,16) - large loss for 2 batches, NaNs on batch 3

NOTE: batch_shape=(32,672,16), 17 calls to train_on_batch per batch

ENVIRONMENT:

Keras 2.2.4 (TensorFlow backend), Python 3.7, Spyder 3.3.7 via Anaconda
GTX 1070 6GB, i7-7700HQ, 12GB RAM, Win-10.0.17134 x64
CuDNN 10+, latest Nvidia drives

ADDITIONAL INFO:

Model divergence is spontaneous, occurring at different train updates even with fixed seeds - Numpy, Random, and TensorFlow random seeds. Furthermore, when first diverging, LSTM layer weights are all normal - only going to NaN later.

Below are, in order: (1) inputs to LSTM; (2) LSTM outputs; (3) Dense(1,'sigmoid') outputs -- the three are consecutive, with Dropout(0.5) between each. Preceding (1) are Conv1D layers. Right: LSTM weights. "BEFORE" = 1 train update before; "AFTER = 1 train update after

BEFORE divergence:

AT divergence:

## LSTM outputs, flattened, stats
(mean,std)        = (inf,nan)
(min,max)         = (0.00e+00,inf)
(abs_min,abs_max) = (0.00e+00,inf)

AFTER divergence:

## Recurrent Gates Weights:
array([[nan, nan, nan, ..., nan, nan, nan],
       [ 0.,  0., -0., ..., -0.,  0.,  0.],
       [ 0., -0., -0., ..., -0.,  0.,  0.],
       ...,
       [nan, nan, nan, ..., nan, nan, nan],
       [ 0.,  0., -0., ..., -0.,  0., -0.],
       [ 0.,  0., -0., ..., -0.,  0.,  0.]], dtype=float32)
## Dense Sigmoid Outputs:
array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]], dtype=float32)

MINIMAL REPRODUCIBLE EXAMPLE:

from keras.layers import Input,Dense,LSTM,Dropout
from keras.models import Model
from keras.optimizers  import Nadam 
from keras.constraints import MaxNorm as maxnorm
import numpy as np

ipt = Input(batch_shape=(32,672,16))
x = LSTM(512, activation='relu', return_sequences=False,
              recurrent_dropout=0.3,
              kernel_constraint   =maxnorm(0.5, axis=0),
              recurrent_constraint=maxnorm(0.5, axis=0))(ipt)
out = Dense(1, activation='sigmoid')(x)

model = Model(ipt,out)
optimizer = Nadam(lr=4e-4, clipnorm=1)
model.compile(optimizer=optimizer,loss='binary_crossentropy')

for train_update,_ in enumerate(range(100)):
    x = np.random.randn(32,672,16)
    y = np.array([1]*5 + [0]*27)
    np.random.shuffle(y)
    loss = model.train_on_batch(x,y)
    print(train_update+1,loss,np.sum(y))

Observations: the following speed up divergence:

Higher units (LSTM)
Higher # of layers (LSTM)
Higher lr << no divergence when <=1e-4, tested up to 400 trains
Less '1' labels << no divergence with y below, even with lr=1e-3; tested up to 400 trains

y = np.random.randint(0,2,32) # makes more '1' labels

UPDATE: not fixed in TF2; reproducible also using from tensorflow.keras imports.

解决方案

Studying LSTM formulae deeper and digging into the source code, everything's come crystal clear.

Verdict: recurrent_dropout has nothing to do with it; a thing's being looped where none expect it.

Actual culprit: the activation argument, now 'relu', is applied on the recurrent transformations - contrary to virtually every tutorial showing it as the harmless 'tanh'.

I.e., activation is not only for the hidden-to-output transform - source code; it operates directly on computing both recurrent states, cell and hidden:

c = f * c_tm1 + i * self.activation(x_c + K.dot(h_tm1_c, self.recurrent_kernel_c))
h = o * self.activation(c)

Solution(s):

Apply BatchNormalization to LSTM's inputs, especially if previous layer's outputs are unbounded (ReLU, ELU, etc)
- If previous layer's activations are tightly bounded (e.g. tanh, sigmoid), apply BN before activations (use activation=None, then BN, then Activation layer)
Use activation='selu'; more stable, but can still diverge
Use lower lr
Apply gradient clipping
Use fewer timesteps

More answers, to some remaining questions:

Why was recurrent_dropout suspected? Unmeticulous testing setup; only now did I focus on forcing divergence without it. It did however, sometimes accelerate divergence - which may be explained by by it zeroing the non-relu contributions that'd otherwise offset multiplicative reinforcement.
Why do nonzero mean inputs accelerate divergence? Additive symmetry; nonzero-mean distributions are asymmetric, with one sign dominating - facilitating large pre-activations, hence large ReLUs.
Why can training be stable for hundreds of iterations with a low lr? Extreme activations induce large gradients via large error; with a low lr, this means weights adjust to prevent such activations - whereas a high lr jumps too far too quickly.
Why do stacked LSTMs diverge faster? In addition to feeding ReLUs to itself, LSTM feeds the next LSTM, which then feeds itself the ReLU'd ReLU's --> fireworks.

UPDATE 1/22/2020: recurrent_dropout may in fact be a contributing factor, as it utilizes inverted dropout, upscaling hidden transformations during training, easing divergent behavior over many timesteps. Git Issue on this here

这篇关于带有 'relu' 的 LSTM 'recurrent_dropout' 产生 NaN的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

带有 'relu' 的 LSTM 'recurrent_dropout' 产生 NaN [英] LSTM 'recurrent_dropout' with 'relu' yields NaNs

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

带有 'relu' 的 LSTM 'recurrent_dropout' 产生 NaN [英] LSTM &#39;recurrent_dropout&#39; with &#39;relu&#39; yields NaNs

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

带有 'relu' 的 LSTM 'recurrent_dropout' 产生 NaN [英] LSTM 'recurrent_dropout' with 'relu' yields NaNs

登录关闭