RNN 正则化:要正则化哪个组件? [英] RNN Regularization: Which Component to Regularize?

查看:12
本文介绍了RNN 正则化:要正则化哪个组件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在构建一个用于分类的 RNN(在 RNN 之后有一个 softmax 层).要规范化的选项有很多,我不确定是否只尝试所有这些,效果会一样吗?在什么情况下我应该对哪些组件进行正则化?

组件是:

  • 内核权重(层输入)
  • 循环权重
  • 偏见
  • 激活函数(层输出)

解决方案

最有效的正则化器将取决于您的特定架构、数据和问题;像往常一样,没有一个单一的削减来统治所有,但有做和(特别是)不做,以及系统方法来决定什么效果最好 - 通过仔细的反省和评估.

<小时>

RNN 正则化是如何工作的?

也许理解它的最佳方法是基于信息.首先,请参阅'学习'是如何工作的?"和RNN:深度与宽度".要理解 RNN 正则化,必须了解 RNN 如何处理信息和学习,参考部分描述了(虽然不是详尽无遗).现在回答问题:

RNN 正则化的目标是任何正则化的目标:最大化信息效用和遍历测试损失函数.然而,具体的方法往往因 RNN 的循环性质而大不相同——有些方法比其他方法效果更好;见下文.

<小时>

RNN 正则化方法:

体重下降

  1. General:缩小权重矩阵的范数('平均值')

    • 线性化,取决于激活;例如sigmoidtanh,但不那么relu
    • 梯度提升,取决于激活;例如sigmoid, tanh grads 在大激活时变得平坦 - 线性化使神经元能够继续学习
  2. 循环权重:默认activation='sigmoid'

    • 优点:线性化可以帮助 BPTT(补救消失梯度),因此也学习长期依赖,因为循环信息效用增加了
    • 缺点:线性化可能会损害表征能力 - 但是,这可以通过堆叠 RNN 来抵消
  3. 内核权重:对于多对一(return_sequences=False),它们的工作方式类似于典型层(例如 密集).然而,对于多对多 (=True),内核权重在每个时间步上都运行,因此优点 &将适用与上述类似的缺点.

辍学:

  • 激活(内核):可以受益,但前提是有限;在实践中,值通常保持小于 0.2.问题:往往会引入过多的噪音,并抹去重要的上下文信息,尤其是在时间步长有限的问题中.
  • 循环激活(recurrent_dropout):.此外,由于模型是有状态的并适合 7 个等效窗口,因此梯度有效地跨越 1200 个时间步长.

    更新 2:查看 9 条新信息和更正

    更新 3:添加重量规范和权重自省代码

    I am building an RNN for classification (there is a softmax layer after the RNN). There are so many options for what to regularize and I am not sure if to just try all of them, would the effect be the same? which components do I regularize for what situation?

    The components being:

    • Kernel weights (layer input)
    • Recurrent weights
    • Bias
    • Activation function (layer output)

    解决方案

    Regularizers that'll work best will depend on your specific architecture, data, and problem; as usual, there isn't a single cut to rule all, but there are do's and (especially) don't's, as well as systematic means of determining what'll work best - via careful introspection and evaluation.


    How does RNN regularization work?

    Perhaps the best approach to understanding it is information-based. First, see "How does 'learning' work?" and "RNN: Depth vs. Width". To understand RNN regularization, one must understand how RNN handles information and learns, which the referred sections describe (though not exhaustively). Now to answer the question:

    RNN regularization's goal is any regularization's goal: maximizing information utility and traversal of the test loss function. The specific methods, however, tend to differ substantially for RNNs per their recurrent nature - and some work better than others; see below.


    RNN regularization methods:

    WEIGHT DECAY

    1. General: shrinks the norm ('average') of the weight matrix

      • Linearization, depending on activation; e.g. sigmoid, tanh, but less so relu
      • Gradient boost, depending on activation; e.g. sigmoid, tanh grads flatten out for large activations - linearizing enables neurons to keep learning
    2. Recurrent weights: default activation='sigmoid'

      • Pros: linearizing can help BPTT (remedy vanishing gradient), hence also learning long-term dependencies, as recurrent information utility is increased
      • Cons: linearizing can harm representational power - however, this can be offset by stacking RNNs
    3. Kernel weights: for many-to-one (return_sequences=False), they work similar to weight decay on a typical layer (e.g. Dense). For many-to-many (=True), however, kernel weights operate on every timestep, so pros & cons similar to above will apply.

    Dropout:

    • Activations (kernel): can benefit, but only if limited; values are usually kept less than 0.2 in practice. Problem: tends to introduce too much noise, and erase important context information, especially in problems w/ limited timesteps.
    • Recurrent activations (recurrent_dropout): the recommended dropout

    Batch Normalization:

    • Activations (kernel): worth trying. Can benefit substantially, or not.
    • Recurrent activations: should work better; see Recurrent Batch Normalization. No Keras implementations yet as far as I know, but I may implement it in the future.

    Weight Constraints: set hard upper-bound on weights l2-norm; possible alternative to weight decay.

    Activity Constraints: don't bother; for most purposes, if you have to manually constrain your outputs, the layer itself is probably learning poorly, and the solution is elsewhere.


    What should I do? Lots of info - so here's some concrete advice:

    1. Weight decay: try 1e-3, 1e-4, see which works better. Do not expect the same value of decay to work for kernel and recurrent_kernel, especially depending on architecture. Check weight shapes - if one is much smaller than the other, apply smaller decay to former

    2. Dropout: try 0.1. If you see improvement, try 0.2 - else, scrap it

    3. Recurrent Dropout: start with 0.2. Improvement --> 0.4. Improvement --> 0.5, else 0.3.

    4. Batch Normalization: try. Improvement --> keep it - else, scrap it.
    5. Recurrent Batchnorm: same as 4.
    6. Weight constraints: advisable w/ higher learning rates to prevent exploding gradients - else use higher weight decay
    7. Activity constraints: probably not (see above)
    8. Residual RNNs: introduce significant changes, along a regularizing effect. See application in IndRNNs
    9. Biases: weight decay and constraints become important upon attaining good backpropagation properties; without them on bias weights but with them on kernel (K) & recurrent kernel (RK) weights, bias weights may grow much faster than the latter two, and dominate the transformation - also leading to exploding gradients. I recommend weight decay / constraint less than or equal to that used on K & RK. Also, with BatchNormalization, you can cannot set use_bias=False as an "equivalent"; BN applies to outputs, not hidden-to-hidden transforms.
    10. Zoneout: don't know, never tried, might work - see paper.
    11. Layer Normalization: some report it working better than BN for RNNs - but my application found it otherwise; paper
    12. Data shuffling: is a strong regularizer. Also shuffle batch samples (samples in batch). See relevant info on stateful RNNs
    13. Optimizer: can be an inherent regularizer. Don't have a full explanation, but in my application, Nadam (& NadamW) has stomped every other optimizer - worth trying.

    Introspection: bottom section on 'learning' isn't worth much without this; don't just look at validation performance and call it a day - inspect the effect that adjusting a regularizer has on weights and activations. Evaluate using info toward bottom & relevant theory.

    BONUS: weight decay can be powerful - even more powerful when done right; turns out, adaptive optimizers like Adam can harm its effectiveness, as described in this paper. Solution: use AdamW. My Keras/TensorFlow implementation here.


    This is too much! Agreed - welcome to Deep Learning. Two tips here:

    1. Bayesian Optimization; will save you time especially on prohibitively expensive training.
    2. Conv1D(strides > 1), for many timesteps (>1000); slashes dimensionality, shouldn't harm performance (may in fact improve it).


    Introspection Code:

    Gradients: see this answer

    Weights: see this answer

    Weight norm tracking: see this Q & A

    Activations: see this answer

    Weights: see_rnn.rnn_histogram or see_rnn.rnn_heatmap (examples in README)


    How does 'learning' work?

    The 'ultimate truth' of machine learning that is seldom discussed or emphasized is, we don't have access to the function we're trying to optimize - the test loss function. All of our work is with what are approximations of the true loss surface - both the train set and the validation set. This has some critical implications:

    1. Train set global optimum can lie very far from test set global optimum
    2. Local optima are unimportant, and irrelevant:
      • Train set local optimum is almost always a better test set optimum
      • Actual local optima are almost impossible for high-dimensional problems; for the case of the "saddle", you'd need the gradients w.r.t. all of the millions of parameters to equal zero at once
      • Local attractors are lot more relevant; the analogy then shifts from "falling into a pit" to "gravitating into a strong field"; once in that field, your loss surface topology is bound to that set up by the field, which defines its own local optima; high LR can help exit a field, much like "escape velocity"

    Further, loss functions are way too complex to analyze directly; a better approach is to localize analysis to individual layers, their weight matrices, and roles relative to the entire NN. Two key considerations are:

    1. Feature extraction capability. Ex: the driving mechanism of deep classifiers is, given input data, to increase class separability with each layer's transformation. Higher quality features will filter out irrelevant information, and deliver what's essential for the output layer (e.g. softmax) to learn a separating hyperplane.

    2. Information utility. Dead neurons, and extreme activations are major culprits of poor information utility; no single neuron should dominate information transfer, and too many neurons shouldn't lie purposeless. Stable activations and weight distributions enable gradient propagation and continued learning.


    How does regularization work? read above first

    In a nutshell, via maximizing NN's information utility, and improving estimates of the test loss function. Each regularization method is unique, and no two exactly alike - see "RNN regularizers".


    RNN: Depth vs. Width: not as simple as "one is more nonlinear, other works in higher dimensions".

    • RNN width is defined by (1) # of input channels; (2) # of cell's filters (output channels). As with CNN, each RNN filter is an independent feature extractor: more is suited for higher-complexity information, including but not limited to: dimensionality, modality, noise, frequency.
    • RNN depth is defined by (1) # of stacked layers; (2) # of timesteps. Specifics will vary by architecture, but from information standpoint, unlike CNNs, RNNs are dense: every timestep influences the ultimate output of a layer, hence the ultimate output of the next layer - so it again isn't as simple as "more nonlinearity"; stacked RNNs exploit both spatial and temporal information.

    Update:

    Here is an example of a near-ideal RNN gradient propagation for 170+ timesteps:

    This is rare, and was achieved via careful regularization, normalization, and hyperparameter tuning. Usually we see a large gradient for the last few timesteps, which drops off sharply toward left - as here. Also, since the model is stateful and fits 7 equivalent windows, gradient effectively spans 1200 timesteps.

    Update 2: see 9 w/ new info & correction

    Update 3: add weight norms & weights introspection code

    这篇关于RNN 正则化:要正则化哪个组件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆