中间层使 tensorflow 优化器停止工作 [英] Intermediate layer makes tensorflow optimizer to stop working

查看:28
本文介绍了中间层使 tensorflow 优化器停止工作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

此图训练了一个简单的信号标识编码器,实际上表明优化器正在改进权重:

 将 tensorflow 导入为 tf将 numpy 导入为 npinitia = tf.random_normal_initializer(0, 1e-3)深度_1 = 16输出深度 = 1I = tf.placeholder(tf.float32, shape=[None,1], name='I') # 输入W = tf.get_variable('W', shape=[1,DEPTH_1], initializer=initia, dtype=tf.float32, trainable=True) # 权重b = tf.get_variable('b', shape=[DEPTH_1], initializer=initia, dtype=tf.float32, trainable=True) # 偏差O = tf.nn.relu(tf.matmul(I, W) + b, name='O') # 激活/输出#W1 = tf.get_variable('W1', shape=[DEPTH_1,DEPTH_1], initializer=initia, dtype=tf.float32) # 权重#b1 = tf.get_variable('b1', shape=[DEPTH_1], initializer=initia, dtype=tf.float32) # 偏差#O1 = tf.nn.relu(tf.matmul(O, W1) + b1, name='O1')W2 = tf.get_variable('W2', shape=[DEPTH_1,OUT_DEPTH], initializer=initia, dtype=tf.float32) # 权重b2 = tf.get_variable('b2', shape=[OUT_DEPTH], initializer=initia, dtype=tf.float32) # 偏差O2 = tf.matmul(O, W2) + b2O2_0 = tf.gather_nd(O2, [[0,0]])估计 0 = 2.0*O2_0eval_inp = tf.gather_nd(I,[[0,0]])k = 1e-5L = 5.0距离 = tf.reduce_sum(tf.square(eval_inp-estimate0))opt = tf.train.GradientDescentOptimizer(1e-3)grads_and_vars = opt.compute_gradients(距离, [W, b, #W1, b1,W2, b2])clipped_grads_and_vars = [(tf.clip_by_value(g, -4.5, 4.5), v) for g, v in grads_and_vars]train_op = opt.apply_gradients(clipped_grads_and_vars)保护程序 = tf.train.Saver()init_op = tf.global_variables_initializer()使用 tf.Session() 作为 sess:sess.run(init_op)对于我在范围内(10000):打印 sess.run([train_op, I, W, distance], feed_dict={ I: 2.0*np.random.rand(1,1) - 1.0})对于范围内的我(10):打印 sess.run([eval_inp, W,estimate0], feed_dict={ I: 2.0*np.random.rand(1,1) - 1.0})

然而,当我取消中间隐藏层的注释并训练结果网络时,我发现权重不再演化:

 将 tensorflow 导入为 tf将 numpy 导入为 npinitia = tf.random_normal_initializer(0, 1e-3)深度_1 = 16输出深度 = 1I = tf.placeholder(tf.float32, shape=[None,1], name='I') # 输入W = tf.get_variable('W', shape=[1,DEPTH_1], initializer=initia, dtype=tf.float32, trainable=True) # 权重b = tf.get_variable('b', shape=[DEPTH_1], initializer=initia, dtype=tf.float32, trainable=True) # 偏差O = tf.nn.relu(tf.matmul(I, W) + b, name='O') # 激活/输出W1 = tf.get_variable('W1', shape=[DEPTH_1,DEPTH_1], initializer=initia, dtype=tf.float32) # 权重b1 = tf.get_variable('b1', shape=[DEPTH_1], initializer=initia, dtype=tf.float32) # 偏差O1 = tf.nn.relu(tf.matmul(O, W1) + b1, name='O1')W2 = tf.get_variable('W2', shape=[DEPTH_1,OUT_DEPTH], initializer=initia, dtype=tf.float32) # 权重b2 = tf.get_variable('b2', shape=[OUT_DEPTH], initializer=initia, dtype=tf.float32) # 偏差O2 = tf.matmul(O1, W2) + b2O2_0 = tf.gather_nd(O2, [[0,0]])估计 0 = 2.0*O2_0eval_inp = tf.gather_nd(I,[[0,0]])距离 = tf.reduce_sum(tf.square(eval_inp-estimate0))opt = tf.train.GradientDescentOptimizer(1e-3)grads_and_vars = opt.compute_gradients(距离, [W, b, W1, b1,W2, b2])clipped_grads_and_vars = [(tf.clip_by_value(g, -4.5, 4.5), v) for g, v in grads_and_vars]train_op = opt.apply_gradients(clipped_grads_and_vars)保护程序 = tf.train.Saver()init_op = tf.global_variables_initializer()使用 tf.Session() 作为 sess:sess.run(init_op)对于我在范围内(10000):打印 sess.run([train_op, I, W, distance], feed_dict={ I: 2.0*np.random.rand(1,1) - 1.0})对于范围内的我(10):打印 sess.run([eval_inp, W,estimate0], feed_dict={ I: 2.0*np.random.rand(1,1) - 1.0})

estimate0 的评估快速收敛于某个固定值,该值与输入信号无关.我不知道为什么会这样

问题:

<块引用>

知道第二个例子有什么问题吗?

解决方案

TL;DR: 神经网络越深,就越应该注意梯度流(见

三层网络

图表显示了 W:0 变量(第一层)的分布以及它们如何从 0 epoch 变为 1000(可点击).事实上,我们可以看到,2 层网络的变化率要高得多.但我想注意梯度分布,它在 3 层网络中更接近于 0(第一个方差约为 0.005,第二个约为 0.000002代码>,即小 1000 倍).这就是梯度消失问题.

如果您有兴趣,这里是帮助程序代码:

for g, v in grads_and_vars:tf.summary.histogram(v.name, v)tf.summary.histogram(v.name + '_grad', g)合并 = tf.summary.merge_all()writer = tf.summary.FileWriter('train_log_layer2', tf.get_default_graph())..._, 摘要 = sess.run([train_op, 合并], feed_dict={I: 2*np.random.rand(1, 1)-1})如果我 % 10 == 0:writer.add_summary(summary, global_step=i)

解决方案

所有的深度网络都在某种程度上受此影响,并且没有通用的解决方案可以自动神奇地修复任何网络.但是有一些技术可以将其推向正确的方向.初始化就是其中之一.

我用以下内容替换了您的正常初始化:

W_init = tf.contrib.layers.xavier_initializer()b_init = tf.constant_initializer(0.1)

关于Xavier init的教程很多,你可以看看

权重仍然没有像以前那样快速移动,但是它们正在移动(注意 W:0 值的比例)并且梯度分布在 0 处变得不那么峰值,因此好多了.

当然,这不是结束.要进一步改进它,您应该实现完整的自动编码器,因为当前损失受 [0,0] 元素重构的影响,因此大多数输出​​未用于优化.您还可以使用不同的优化器(我会选择 Adam)和学习率.

This graph trains a simple signal identity encoder, and in fact shows that the weights are being evolved by the optimizer:

import tensorflow as tf
import numpy as np
initia = tf.random_normal_initializer(0, 1e-3)

DEPTH_1 = 16
OUT_DEPTH = 1
I = tf.placeholder(tf.float32, shape=[None,1], name='I') # input
W = tf.get_variable('W', shape=[1,DEPTH_1], initializer=initia, dtype=tf.float32, trainable=True) # weights
b = tf.get_variable('b', shape=[DEPTH_1], initializer=initia, dtype=tf.float32, trainable=True) # biases
O = tf.nn.relu(tf.matmul(I, W) + b, name='O') # activation / output

#W1 = tf.get_variable('W1', shape=[DEPTH_1,DEPTH_1], initializer=initia, dtype=tf.float32) # weights
#b1 = tf.get_variable('b1', shape=[DEPTH_1], initializer=initia, dtype=tf.float32) # biases
#O1 = tf.nn.relu(tf.matmul(O, W1) + b1, name='O1')

W2 = tf.get_variable('W2', shape=[DEPTH_1,OUT_DEPTH], initializer=initia, dtype=tf.float32) # weights
b2 = tf.get_variable('b2', shape=[OUT_DEPTH], initializer=initia, dtype=tf.float32) # biases
O2 = tf.matmul(O, W2) + b2

O2_0 = tf.gather_nd(O2, [[0,0]])

estimate0 = 2.0*O2_0

eval_inp = tf.gather_nd(I,[[0,0]])
k = 1e-5
L = 5.0
distance = tf.reduce_sum( tf.square( eval_inp - estimate0 ) )

opt = tf.train.GradientDescentOptimizer(1e-3)
grads_and_vars = opt.compute_gradients(distance, [W, b, #W1, b1,
  W2, b2])
clipped_grads_and_vars = [(tf.clip_by_value(g, -4.5, 4.5), v) for g, v in grads_and_vars]

train_op = opt.apply_gradients(clipped_grads_and_vars)

saver = tf.train.Saver()
init_op = tf.global_variables_initializer()

with tf.Session() as sess:
  sess.run(init_op)
  for i in range(10000):
    print sess.run([train_op, I, W, distance], feed_dict={ I: 2.0*np.random.rand(1,1) - 1.0})
  for i in range(10):
    print sess.run([eval_inp, W, estimate0], feed_dict={ I: 2.0*np.random.rand(1,1) - 1.0})

However, when I uncomment the intermediate hidden layer and train the resulting network, I see that the weights are not evolving anymore:

import tensorflow as tf
import numpy as np
initia = tf.random_normal_initializer(0, 1e-3)

DEPTH_1 = 16
OUT_DEPTH = 1
I = tf.placeholder(tf.float32, shape=[None,1], name='I') # input
W = tf.get_variable('W', shape=[1,DEPTH_1], initializer=initia, dtype=tf.float32, trainable=True) # weights
b = tf.get_variable('b', shape=[DEPTH_1], initializer=initia, dtype=tf.float32, trainable=True) # biases
O = tf.nn.relu(tf.matmul(I, W) + b, name='O') # activation / output

W1 = tf.get_variable('W1', shape=[DEPTH_1,DEPTH_1], initializer=initia, dtype=tf.float32) # weights
b1 = tf.get_variable('b1', shape=[DEPTH_1], initializer=initia, dtype=tf.float32) # biases
O1 = tf.nn.relu(tf.matmul(O, W1) + b1, name='O1')

W2 = tf.get_variable('W2', shape=[DEPTH_1,OUT_DEPTH], initializer=initia, dtype=tf.float32) # weights
b2 = tf.get_variable('b2', shape=[OUT_DEPTH], initializer=initia, dtype=tf.float32) # biases
O2 = tf.matmul(O1, W2) + b2

O2_0 = tf.gather_nd(O2, [[0,0]])

estimate0 = 2.0*O2_0

eval_inp = tf.gather_nd(I,[[0,0]])

distance = tf.reduce_sum( tf.square( eval_inp - estimate0 ) )

opt = tf.train.GradientDescentOptimizer(1e-3)
grads_and_vars = opt.compute_gradients(distance, [W, b, W1, b1,
  W2, b2])
clipped_grads_and_vars = [(tf.clip_by_value(g, -4.5, 4.5), v) for g, v in grads_and_vars]

train_op = opt.apply_gradients(clipped_grads_and_vars)

saver = tf.train.Saver()
init_op = tf.global_variables_initializer()

with tf.Session() as sess:
  sess.run(init_op)
  for i in range(10000):
    print sess.run([train_op, I, W, distance], feed_dict={ I: 2.0*np.random.rand(1,1) - 1.0})
  for i in range(10):
    print sess.run([eval_inp, W, estimate0], feed_dict={ I: 2.0*np.random.rand(1,1) - 1.0})

The evaluation of estimate0 converging quickly in some fixed value that becomes independient from the input signal. I have no idea why this is happening

Question:

Any idea what might be wrong with the second example?

解决方案

TL;DR: the deeper the neural network becomes, the more you should pay attention to the gradient flow (see this discussion of "vanishing gradients"). One particular case is variables initialization.


Problem analysis

I've added tensorboard summaries for the variables and gradients into both of your scripts and got the following:

2-layer network

3-layer network

The charts show the distributions of W:0 variable (the first layer) and how they are changed from 0 epoch to 1000 (clickable). Indeed, we can see, the rate of change is much higher in a 2-layer network. But I'd like to pay attention to the gradient distribution, which is much closer to 0 in a 3-layer network (first variance is around 0.005, the second one is around 0.000002, i.e. 1000 times smaller). This is the vanishing gradient problem.

Here's the helper code if you're interested:

for g, v in grads_and_vars:
  tf.summary.histogram(v.name, v)
  tf.summary.histogram(v.name + '_grad', g)

merged = tf.summary.merge_all()
writer = tf.summary.FileWriter('train_log_layer2', tf.get_default_graph())

...

_, summary = sess.run([train_op, merged], feed_dict={I: 2*np.random.rand(1, 1)-1})
if i % 10 == 0:
  writer.add_summary(summary, global_step=i)

Solution

All deep networks suffer from this to some extent and there is no universal solution that will auto-magically fix any network. But there are some techniques that can push it in the right direction. Initialization is one of them.

I replaced your normal initialization with:

W_init = tf.contrib.layers.xavier_initializer()
b_init = tf.constant_initializer(0.1)

There are lots of tutorials on Xavier init, you can take a look at this one, for example. Note that I set the bias init to be slightly positive to make sure that ReLu outputs are positive for the most of neurons, at least in the beginning.

This changed the picture immediately:

The weights are still not moving quite as fast as before, but they are moving (note the scale of W:0 values) and the gradients distribution became much less peaked at 0, thus much better.

Of course, it's not the end. To improve it further, you should implement the full autoencoder, because currently the loss is affected by the [0,0] element reconstruction, so most outputs aren't used in optimization. You can also play with different optimizers (Adam would be my choice) and the learning rates.

这篇关于中间层使 tensorflow 优化器停止工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆