同时训练两个子图时如何处理渐变 [英] How to handle gradients when training two sub-graphs simultaneously

查看:52
本文介绍了同时训练两个子图时如何处理渐变的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要实现的一般想法是seq2seq模型(取自模型中的基于seq2seq类的translate.py-example).这训练得很好.

The general idea I am trying to realize is a seq2seq-model (taken from the translate.py-example in the models, based on the seq2seq-class). This trains well.

此外,在所有编码完成后,我将使用rnn的隐藏状态,就在解码开始之前(我称其为编码结束时的隐藏状态").我在编码结束时使用了这种隐藏状态,以将其馈入另一个称为价格"的子图(请参见下文).该子图反向传播器的训练梯度不仅通过该附加子图,而且还返回到rnn的编码器部分(这是我想要和需要的).

Furthermore I am using the hidden state of the rnn after all the encoding is done, right before decoding starts (I call it the "hidden state at end of encoding"). I use this hidden state at end of encoding to feed it into a further sub-graph which I call "prices" (see below). The training gradients of this sub-graph backprop not only through this additional sub-graph, but also back into the encoder-part of the rnn (which is what I want and need).

计划是在编码结束时将更多此类子图添加到隐藏状态,因为我想以多种方式来分析输入短语.

The plan is to add more such sub-graph to the hidden state at end of encoding, as I want to analyze the input phrases in a variety of ways.

现在在训练期间,当我同时评估和训练两个子图(编码器+价格和编码器+解码器)时,网络不会收敛.但是,如果我通过以下方式(伪代码)执行训练来进行训练:

Now during training when I evaluate and train both sub-graphs (encoder+prices AND encoder+decoder) at the same time, the net does NOT converge. However, if I train by executing the training in the following way (pseudo-code):

if global_step % 10 == 0:
    execute-the-price-training_code
else:
    execute-the-decoder-training_code

因此,我不会同时训练两个子图.现在它确实收敛了,但是编码器+解码器部分的收敛速度比如果我只训练该部分而从不训练价格子图慢得多.

So I am not training both sub-graphs simultaneously. Now it does converge, but the encoder+decoder-part converges MUCH slower than if I ONLY train this part and never train the prices-sub-graph.

我的问题是:我应该能够同时训练两个子图.但是可能我必须重新调整渐变,使其在编码结束时又回到隐藏状态.在这里,我们从价格子图和从解码器子图获得梯度.如何重新缩放.我没有找到任何描述这种工作的论文,但也许我在搜索错误的关键字.

My question is: I should be able to train both sub-graphs simultaneously. But probably I have to rescale the gradients flowing back into the hidden state at end of encoding. Here we get the gradients from the prices sub-graph AND from the decoder-sub-graph. How should this rescaling be done. I didnt find any papers describing such an undertaking, but maybe I am searching with the wrong keywords.

这是代码的训练部分:

这是(几乎是原始的)training-op-preparation:

This is the (almost original) training-op-preparation:

if not forward_only:
  self.gradient_norms = []
  self.updates = []
  opt = tf.train.AdadeltaOptimizer(self.learning_rate)

  for bucket_id in xrange(len(buckets)):
    tf.scalar_summary("seq2seq loss", self.losses[bucket_id])

    gradients = tf.gradients(self.losses[bucket_id], var_list_seq2seq)
    clipped_gradients, norm = tf.clip_by_global_norm(gradients, max_gradient_norm)
    self.gradient_norms.append(norm)
    self.updates.append(opt.apply_gradients(zip(clipped_gradients, var_list_seq2seq), global_step=self.global_step))

现在,另外,我正在运行第二个子图,该子图将编码结束时的隐藏状态作为输入:

Now, additionally, I am running a second sub-graph that takes the hidden state at end of encoding as input:

  with tf.name_scope('prices') as scope:
    #First layer
    W_price_first_layer = tf.Variable(tf.random_normal([num_layers*size, self.prices_hidden_layer_size], stddev=0.35), name="W_price_first_layer")
    B_price_first_layer = tf.Variable(tf.zeros([self.prices_hidden_layer_size]), name="B_price_first_layer")
    self.output_price_first_layer = tf.add(tf.matmul(self.hidden_state, W_price_first_layer), B_price_first_layer)
    self.activation_price_first_layer = tf.nn.sigmoid(self.output_price_first_layer)
    #self.activation_price_first_layer = tf.nn.Relu(self.output_price_first_layer)

    #Second layer to softmax (price ranges)
    W_price = tf.Variable(tf.random_normal([self.prices_hidden_layer_size, self.prices_bit_size], stddev=0.35), name="W_price")
    W_price_t = tf.transpose(W_price)
    B_price = tf.Variable(tf.zeros([self.prices_bit_size]), name="B_price")

    self.output_price_second_layer = tf.add(tf.matmul(self.activation_price_first_layer, W_price),B_price)
    self.price_prediction = tf.nn.softmax(self.output_price_second_layer)
    self.label_price      = tf.placeholder(tf.int32, shape=[self.batch_size], name="price_label")

    #Remember the prices trainables
    var_list_prices       = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, "prices")
    var_list_all          = tf.trainable_variables()

    #Backprop
    self.loss_price        = tf.nn.sparse_softmax_cross_entropy_with_logits(self.output_price_second_layer, self.label_price)
    self.loss_price_scalar = tf.reduce_mean(self.loss_price)
    self.optimizer_price   = tf.train.AdadeltaOptimizer(self.learning_rate_prices)
    self.training_op_price = self.optimizer_price.minimize(self.loss_price, var_list=var_list_all)

打一堆

推荐答案

我希望同时运行两个优化器会导致公共变量的梯度更新不一致,这可能会导致您的训练无法收敛.

I expect that running two optimizers simultaneously will lead to inconsistent gradient updates on the common variables, and this might be causing your training not to converge.

相反,如果将每个子网的标量损失添加到损失集合"中(例如,通过

Instead, if you add the scalar loss from each sub-network to the "losses collection" (e.g. via tf.contrib.losses.add_loss() or tf.add_to_collection(tf.GraphKeys.LOSSES, ...), you can use tf.contrib.losses.get_total_loss() to get a single loss value that can be passed to a single standard TensorFlow tf.train.Optimizer subclass. TensorFlow will derive the appropriate back-prop computation for your split network.

get_total_loss()方法仅计算已添加到损失收集中的值的未加权总和.我不熟悉如何缩放这些值或如何缩放这些值的文献,但是您可以使用任何任意(可微分)TensorFlow表达式来组合损失并将结果传递给单个优化器.

The get_total_loss() method simply computes an unweighted sum of the values that have been added to the losses collection. I'm not familiar with the literature on how or if you should scale these values, but you can use any arbitrary (differentiable) TensorFlow expression to combine the losses and pass the result to a single optimizer.

这篇关于同时训练两个子图时如何处理渐变的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆