中间层使Tensorflow优化器停止工作 [英] Intermediate layer makes tensorflow optimizer to stop working

查看:80
本文介绍了中间层使Tensorflow优化器停止工作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

此图训练了一个简单的信号身份编码器,并且实际上表明权重正在由优化程序进行演化:

This graph trains a simple signal identity encoder, and in fact shows that the weights are being evolved by the optimizer:

import tensorflow as tf
import numpy as np
initia = tf.random_normal_initializer(0, 1e-3)

DEPTH_1 = 16
OUT_DEPTH = 1
I = tf.placeholder(tf.float32, shape=[None,1], name='I') # input
W = tf.get_variable('W', shape=[1,DEPTH_1], initializer=initia, dtype=tf.float32, trainable=True) # weights
b = tf.get_variable('b', shape=[DEPTH_1], initializer=initia, dtype=tf.float32, trainable=True) # biases
O = tf.nn.relu(tf.matmul(I, W) + b, name='O') # activation / output

#W1 = tf.get_variable('W1', shape=[DEPTH_1,DEPTH_1], initializer=initia, dtype=tf.float32) # weights
#b1 = tf.get_variable('b1', shape=[DEPTH_1], initializer=initia, dtype=tf.float32) # biases
#O1 = tf.nn.relu(tf.matmul(O, W1) + b1, name='O1')

W2 = tf.get_variable('W2', shape=[DEPTH_1,OUT_DEPTH], initializer=initia, dtype=tf.float32) # weights
b2 = tf.get_variable('b2', shape=[OUT_DEPTH], initializer=initia, dtype=tf.float32) # biases
O2 = tf.matmul(O, W2) + b2

O2_0 = tf.gather_nd(O2, [[0,0]])

estimate0 = 2.0*O2_0

eval_inp = tf.gather_nd(I,[[0,0]])
k = 1e-5
L = 5.0
distance = tf.reduce_sum( tf.square( eval_inp - estimate0 ) )

opt = tf.train.GradientDescentOptimizer(1e-3)
grads_and_vars = opt.compute_gradients(distance, [W, b, #W1, b1,
  W2, b2])
clipped_grads_and_vars = [(tf.clip_by_value(g, -4.5, 4.5), v) for g, v in grads_and_vars]

train_op = opt.apply_gradients(clipped_grads_and_vars)

saver = tf.train.Saver()
init_op = tf.global_variables_initializer()

with tf.Session() as sess:
  sess.run(init_op)
  for i in range(10000):
    print sess.run([train_op, I, W, distance], feed_dict={ I: 2.0*np.random.rand(1,1) - 1.0})
  for i in range(10):
    print sess.run([eval_inp, W, estimate0], feed_dict={ I: 2.0*np.random.rand(1,1) - 1.0})

但是,当我取消注释中间隐藏层并训练生成的网络时,我发现权重不再在变化:

However, when I uncomment the intermediate hidden layer and train the resulting network, I see that the weights are not evolving anymore:

import tensorflow as tf
import numpy as np
initia = tf.random_normal_initializer(0, 1e-3)

DEPTH_1 = 16
OUT_DEPTH = 1
I = tf.placeholder(tf.float32, shape=[None,1], name='I') # input
W = tf.get_variable('W', shape=[1,DEPTH_1], initializer=initia, dtype=tf.float32, trainable=True) # weights
b = tf.get_variable('b', shape=[DEPTH_1], initializer=initia, dtype=tf.float32, trainable=True) # biases
O = tf.nn.relu(tf.matmul(I, W) + b, name='O') # activation / output

W1 = tf.get_variable('W1', shape=[DEPTH_1,DEPTH_1], initializer=initia, dtype=tf.float32) # weights
b1 = tf.get_variable('b1', shape=[DEPTH_1], initializer=initia, dtype=tf.float32) # biases
O1 = tf.nn.relu(tf.matmul(O, W1) + b1, name='O1')

W2 = tf.get_variable('W2', shape=[DEPTH_1,OUT_DEPTH], initializer=initia, dtype=tf.float32) # weights
b2 = tf.get_variable('b2', shape=[OUT_DEPTH], initializer=initia, dtype=tf.float32) # biases
O2 = tf.matmul(O1, W2) + b2

O2_0 = tf.gather_nd(O2, [[0,0]])

estimate0 = 2.0*O2_0

eval_inp = tf.gather_nd(I,[[0,0]])

distance = tf.reduce_sum( tf.square( eval_inp - estimate0 ) )

opt = tf.train.GradientDescentOptimizer(1e-3)
grads_and_vars = opt.compute_gradients(distance, [W, b, W1, b1,
  W2, b2])
clipped_grads_and_vars = [(tf.clip_by_value(g, -4.5, 4.5), v) for g, v in grads_and_vars]

train_op = opt.apply_gradients(clipped_grads_and_vars)

saver = tf.train.Saver()
init_op = tf.global_variables_initializer()

with tf.Session() as sess:
  sess.run(init_op)
  for i in range(10000):
    print sess.run([train_op, I, W, distance], feed_dict={ I: 2.0*np.random.rand(1,1) - 1.0})
  for i in range(10):
    print sess.run([eval_inp, W, estimate0], feed_dict={ I: 2.0*np.random.rand(1,1) - 1.0})

estimate0的评估迅速收敛于某个固定值,该固定值与输入信号无关.我不知道为什么会这样

The evaluation of estimate0 converging quickly in some fixed value that becomes independient from the input signal. I have no idea why this is happening

问题:

您知道第二个示例可能出什么问题吗?

Any idea what might be wrong with the second example?

推荐答案

TL; DR::神经网络越深,您就越应该注意梯度流(请参见消失梯度"的讨论).一种特殊情况是变量初始化.

TL;DR: the deeper the neural network becomes, the more you should pay attention to the gradient flow (see this discussion of "vanishing gradients"). One particular case is variables initialization.

我已经在两个脚本中添加了变量和渐变的张量板摘要,并获得了以下内容:

I've added tensorboard summaries for the variables and gradients into both of your scripts and got the following:

2层网络

三层网络

这些图表显示了W:0变量的分布(第一层)以及它们如何从0个历元更改为1000个(可单击).实际上,我们可以看到,在2层网络中,变化率要高得多.但是我要注意梯度分布,它在3层网络中非常接近0(第一个方差在0.005附近,第二个方差在0.000002附近,即小1000倍).这是消失梯度问题.

The charts show the distributions of W:0 variable (the first layer) and how they are changed from 0 epoch to 1000 (clickable). Indeed, we can see, the rate of change is much higher in a 2-layer network. But I'd like to pay attention to the gradient distribution, which is much closer to 0 in a 3-layer network (first variance is around 0.005, the second one is around 0.000002, i.e. 1000 times smaller). This is the vanishing gradient problem.

如果您有兴趣,请使用以下帮助代码:

Here's the helper code if you're interested:

for g, v in grads_and_vars:
  tf.summary.histogram(v.name, v)
  tf.summary.histogram(v.name + '_grad', g)

merged = tf.summary.merge_all()
writer = tf.summary.FileWriter('train_log_layer2', tf.get_default_graph())

...

_, summary = sess.run([train_op, merged], feed_dict={I: 2*np.random.rand(1, 1)-1})
if i % 10 == 0:
  writer.add_summary(summary, global_step=i)

解决方案

所有深度网络都会在某种程度上遭受此困扰, 没有通用的解决方案可以自动修复 any 网络.但是,有些技术可以将其推向正确的方向.初始化就是其中之一.

Solution

All deep networks suffer from this to some extent and there is no universal solution that will auto-magically fix any network. But there are some techniques that can push it in the right direction. Initialization is one of them.

我将您的常规初始化替换为:

I replaced your normal initialization with:

W_init = tf.contrib.layers.xavier_initializer()
b_init = tf.constant_initializer(0.1)

关于Xavier init的教程很多,您可以看一下这一个 , 例如. 请注意,我将bias init设置为略为正,以确保ReLu输出对大多数神经元至少在开始时为正.

There are lots of tutorials on Xavier init, you can take a look at this one, for example. Note that I set the bias init to be slightly positive to make sure that ReLu outputs are positive for the most of neurons, at least in the beginning.

这立即改变了图片:

权重的移动速度仍然不如以前快,但它们却在移动(请注意W:0值的标度),并且梯度分布在0处的峰值不再那么多,因此更好.

The weights are still not moving quite as fast as before, but they are moving (note the scale of W:0 values) and the gradients distribution became much less peaked at 0, thus much better.

当然,这还没有结束.为了进一步改善它,您应该实现完整的自动编码器,因为当前损耗受[0,0]元素重构的影响,因此大多数输出​​未用于优化.您还可以使用不同的优化器(亚当是我的选择)和学习率.

Of course, it's not the end. To improve it further, you should implement the full autoencoder, because currently the loss is affected by the [0,0] element reconstruction, so most outputs aren't used in optimization. You can also play with different optimizers (Adam would be my choice) and the learning rates.

这篇关于中间层使Tensorflow优化器停止工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆