神经网络:神秘的ReLu [英] Neural Network: Mysterious ReLu

查看:105
本文介绍了神经网络:神秘的ReLu的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

作为大型项目的一部分,我一直在构建编程语言检测器,即代码段的分类器. 我的基准模型非常简单:将输入标记化,然后将代码段编码为单词袋,或者, 在这种情况下,令牌袋,并在这些功能之上创建一个简单的NN.

I've been building a programming language detector, i.e., a classifier of code snippets, as part of a bigger project. My baseline model is pretty straight-forward: tokenize the input and encode the snippets as bag-of-words or, in this case, bag-of-tokens, and make a simple NN on top of these features.

NN的输入是固定长度的计数器数组,这些计数器是最独特的令牌(例如"def", 从语料库中自动提取的"self""function""->""const""#include"等. 这个想法是,这些标记对于编程语言而言是非常独特的,因此即使是这种幼稚的方法也应该能够 准确性得分高.

The input to NN is a fixed-length array of counters of most distinctive tokens, such as "def", "self", "function", "->", "const", "#include", etc., that are automatically extracted from the corpus. The idea is that these tokens are pretty unique to programming languages, so even this naive approach should get high accuracy score.

Input:
  def   1
  for   2
  in    2
  True  1
  ):    3
  ,:    1

  ...

Output: python

设置

我很快就获得了99%的准确度,并确定它可以按预期工作.这是模特 (完整的可运行脚本是此处):

Setup

I got 99% accuracy pretty quickly and decided that's the sign that it works just as expected. Here's the model (a full runnable script is here):

# Placeholders
x = tf.placeholder(shape=[None, vocab_size], dtype=tf.float32, name='x')
y = tf.placeholder(shape=[None], dtype=tf.int32, name='y')
training = tf.placeholder_with_default(False, shape=[], name='training')

# One hidden layer with dropout
reg = tf.contrib.layers.l2_regularizer(0.01)
hidden1 = tf.layers.dense(x, units=96, kernel_regularizer=reg, 
                          activation=tf.nn.elu, name='hidden1')
dropout1 = tf.layers.dropout(hidden1, rate=0.2, training=training, name='dropout1')

# Output layer
logits = tf.layers.dense(dropout1, units=classes, kernel_regularizer=reg,
                         activation=tf.nn.relu, name='logits')

# Cross-entropy loss
loss = tf.reduce_mean(
    tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, abels=y))

# Misc reports: accuracy, correct/misclassified samples, etc.
correct_predicted = tf.nn.in_top_k(logits, y, 1, name='in-top-k')
prediction = tf.argmax(logits, axis=1)
wrong_predicted = tf.logical_not(correct_predicted, name='not-in-top-k')
x_misclassified = tf.boolean_mask(x, wrong_predicted, name='misclassified')
accuracy = tf.reduce_mean(tf.cast(correct_predicted, tf.float32), name='accuracy')

输出非常令人鼓舞:

iteration=5  loss=2.580  train-acc=0.34277
iteration=10  loss=2.029  train-acc=0.69434
iteration=15  loss=2.054  train-acc=0.92383
iteration=20  loss=1.934  train-acc=0.98926
iteration=25  loss=1.942  train-acc=0.99609
Files.VAL mean accuracy = 0.99121             <-- After just 1 epoch!

iteration=30  loss=1.943  train-acc=0.99414
iteration=35  loss=1.947  train-acc=0.99512
iteration=40  loss=1.946  train-acc=0.99707
iteration=45  loss=1.946  train-acc=0.99609
iteration=50  loss=1.944  train-acc=0.99902
iteration=55  loss=1.946  train-acc=0.99902
Files.VAL mean accuracy = 0.99414

测试准确性也约为1.0.一切看起来都很完美.

Test accuracy was also around 1.0. Everything looked perfect.

但是后来我注意到,我将activation=tf.nn.relu放入了最终的密集层(logits),这显然是一个错误: 不需要在softmax之前丢弃否定分数,因为它们表明类别的可能性很小. 零阈值只会人为地使这些类别更可能出现,这是一个错误.摆脱它只能使模型对正确的类更加健壮和自信.

But then I noticed that I put activation=tf.nn.relu into the final dense layer (logits), which is clearly a bug: there is no need to discard negative scores before softmax, because they indicate the classes with low probability. Zero threshold will only make these classes artificially more probable, which would be a mistake. Getting rid of it should only make the model more robust and confident in the correct class.

这就是我的想法. 所以我用activation=None替换了它,再次运行模型,然后发生了一件令人惊讶的事情: 性能没有提高.完全没有.实际上,它明显降级:

That's what I thought. So I replaced it with activation=None, run the model again and then a surprising thing happened: the performance didn't improve. At all. In fact, it degraded significantly:

iteration=5  loss=5.236  train-acc=0.16602
iteration=10  loss=4.068  train-acc=0.18750
iteration=15  loss=3.110  train-acc=0.37402
iteration=20  loss=5.149  train-acc=0.14844
iteration=25  loss=2.880  train-acc=0.18262
Files.VAL mean accuracy = 0.28711

iteration=30  loss=3.136  train-acc=0.25781
iteration=35  loss=2.916  train-acc=0.22852
iteration=40  loss=2.156  train-acc=0.39062
iteration=45  loss=1.777  train-acc=0.45312
iteration=50  loss=2.726  train-acc=0.33105
Files.VAL mean accuracy = 0.29362

经过培训,准确性提高了,但从未超过91-92%.我前后多次更改了激活方式, 改变不同的参数(层大小,辍学,正则化器,额外的层等),并始终具有相同的结果: 错误"模型立即达到了99%,而正确"模型在经过50个纪元后几乎没有达到90%.根据 张量板,重量分布没有太大差异:梯度没有消失并且两个模型都知道了 通常.

The accuracy got better with training, but never surpassed 91-92%. I changed the activation back and forth several times, varying different parameters (layer size, dropout, regularizer, extra layers, anything) and always had the same outcome: the "wrong" model hit 99% immediately, while the "right" model barely achieved 90% after 50 epochs. According to tensorboard, there was no big difference in weight distribution: the gradients didn't die out and both models learned normally.

这怎么可能?最终的ReLu如何使模型如此出色?尤其是如果ReLu是一个错误?

How is this possible? How can the final ReLu make a model so much superior? Especially if this ReLu is a bug?

推荐答案

预测分布

玩了一段时间之后,我决定将两种模型的实际预测分布可视化:

Prediction distribution

After playing around with it for a while, I decided to visualize the actual prediction distribution for both models:

predicted_distribution = tf.nn.softmax(logits, name='distribution')

下面是分布的直方图以及它们随时间的变化情况.

Below are the histograms of the distributions and how they evolved over time.

使用ReLu(型号错误)

没有ReLu(正确的型号)

第一个直方图很有意义,大多数概率都接近0. 但是ReLu模型的直方图是可疑的:经过几次迭代,这些值似乎集中在0.15周围.打印实际的预测证实了这个想法:

The first histogram makes sense, most of probabilities are close to 0. But the histogram of the ReLu model is suspicious: the values seem to concentrate around 0.15 after few iterations. Printing the actual predictions confirmed this idea:

[0.14286 0.14286 0.14286 0.14286 0.14286 0.14286 0.14286]
[0.14286 0.14286 0.14286 0.14286 0.14286 0.14286 0.14286]

我有7个班级(当时针对7种不同的语言),而0.142861/7.事实证明,完美"模型学会了输出 0 logits,然后将其翻译成统一的预测.

I had 7 classes (for 7 different languages at that moment) and 0.14286 is 1/7. It turns out, the "perfect" model learned to output 0 logits, which in turn translated in uniform prediction.

但是如何报告此 分布的准确度是99%?

But how can this distribution be reported as 99% accurate?

在深入研究 tf.nn.in_top_k 之前,我检查了另一种方法计算准确性:

Before diving into tf.nn.in_top_k I checked an alternative way to compute accuracy:

true_correct = tf.equal(tf.argmax(logits, 1), tf.cast(y, tf.int64))
alternative_accuracy = tf.reduce_mean(tf.cast(true_correct, tf.float32))

...对最高的预测类别和基本事实进行诚实的比较.结果是这样的:

... which performs honest comparison of the highest predicted class and the ground truth. The result is this:

iteration=2  loss=3.992  train-acc=0.13086  train-alt-acc=0.13086
iteration=4  loss=3.590  train-acc=0.13086  train-alt-acc=0.12207
iteration=6  loss=2.871  train-acc=0.21777  train-alt-acc=0.13672
iteration=8  loss=2.466  train-acc=0.37695  train-alt-acc=0.16211
iteration=10  loss=2.099  train-acc=0.62305  train-alt-acc=0.10742
iteration=12  loss=2.066  train-acc=0.79980  train-alt-acc=0.17090
iteration=14  loss=2.016  train-acc=0.84277  train-alt-acc=0.17285
iteration=16  loss=1.954  train-acc=0.91309  train-alt-acc=0.13574
iteration=18  loss=1.956  train-acc=0.95508  train-alt-acc=0.06445
iteration=20  loss=1.923  train-acc=0.97754  train-alt-acc=0.11328

的确,tf.nn.in_top_kk=1迅速偏离了正确的精度,并开始报告幻想的99%值. 那么,它实际上是做什么的呢?这是文档 对此说:

Indeed, tf.nn.in_top_k with k=1 diverged from the right accuracy quickly and began to report fantasized 99% values. So what does it do actually? Here's what the documentation says about it:

确定目标是否在前K个预测中.

Says whether the targets are in the top K predictions.

这将输出一个batch_size bool数组,如果目标类的预测在前k个之中,则条目out[i]为true 例如,所有预测中的预测. 请注意,InTopK的行为与TopK op的关系处理不同. 如果多个类别具有相同的预测值并跨越了前k个边界, 所有这些类都被认为在前k位.

This outputs a batch_size bool array, an entry out[i] is true if the prediction for the target class is among the top k predictions among all predictions for example i. Note that the behavior of InTopK differs from the TopK op in its handling of ties; if multiple classes have the same prediction value and straddle the top-k boundary, all of those classes are considered to be in the top k.

就是这样.如果概率是统一的(实际上意味着我不知道"),那么它们都是正确的.这种情况甚至更糟,因为如果logits分布几乎 均匀,则softmax可能会将其转换为精确均匀分布,如以下简单示例所示:

That's what it is. If the probabilities are uniform (which actually means "I have no idea"), they are all correct. The situation is even worse, because if the logits distribution is almost uniform, softmax may transform it into exactly uniform distribution, as can be seen in this simple example:

x = tf.constant([0, 1e-8, 1e-8, 1e-9])
tf.nn.softmax(x).eval()
# >>> array([0.25, 0.25, 0.25, 0.25], dtype=float32)

...,这意味着根据tf.nn.in_top_k规范,每个接近统一的预测都可以被视为正确".

... which means that every nearly uniform prediction, may be considered "correct" according to tf.nn.in_top_k spec.

tf.nn.in_top_k是张量流中精度度量的危险选择,因为它可能会默默地吞噬错误的预测 并报告为正确".相反,您应始终使用以下长但受信任的表达式:

tf.nn.in_top_k is a dangerous choice of accuracy measure in tensorflow, because it may silently swallow wrong predictions and report them as "correct". Instead, you should always use this long but trusted expression:

accuracy = tf.reduce_mean(tf.cast(tf.equal(tf.argmax(logits, 1), tf.cast(y, tf.int64)), tf.float32))

这篇关于神经网络:神秘的ReLu的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆