同时训练两个子图时如何处理渐变 [英] How to handle gradients when training two sub-graphs simultaneously

查看：52 发布时间：2020/5/4 9:54:04 machine-learning neural-network tensorflow

本文介绍了同时训练两个子图时如何处理渐变的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我要实现的一般想法是seq2seq模型(取自模型中的基于seq2seq类的translate.py-example).这训练得很好.

The general idea I am trying to realize is a seq2seq-model (taken from the translate.py-example in the models, based on the seq2seq-class). This trains well.

此外，在所有编码完成后，我将使用rnn的隐藏状态，就在解码开始之前(我称其为编码结束时的隐藏状态").我在编码结束时使用了这种隐藏状态，以将其馈入另一个称为价格"的子图(请参见下文).该子图反向传播器的训练梯度不仅通过该附加子图，而且还返回到rnn的编码器部分(这是我想要和需要的).

Furthermore I am using the hidden state of the rnn after all the encoding is done, right before decoding starts (I call it the "hidden state at end of encoding"). I use this hidden state at end of encoding to feed it into a further sub-graph which I call "prices" (see below). The training gradients of this sub-graph backprop not only through this additional sub-graph, but also back into the encoder-part of the rnn (which is what I want and need).

计划是在编码结束时将更多此类子图添加到隐藏状态，因为我想以多种方式来分析输入短语.

The plan is to add more such sub-graph to the hidden state at end of encoding, as I want to analyze the input phrases in a variety of ways.

现在在训练期间，当我同时评估和训练两个子图(编码器+价格和编码器+解码器)时，网络不会收敛.但是，如果我通过以下方式(伪代码)执行训练来进行训练:

Now during training when I evaluate and train both sub-graphs (encoder+prices AND encoder+decoder) at the same time, the net does NOT converge. However, if I train by executing the training in the following way (pseudo-code):

if global_step % 10 == 0:
    execute-the-price-training_code
else:
    execute-the-decoder-training_code

因此，我不会同时训练两个子图.现在它确实收敛了，但是编码器+解码器部分的收敛速度比如果我只训练该部分而从不训练价格子图慢得多.

So I am not training both sub-graphs simultaneously. Now it does converge, but the encoder+decoder-part converges MUCH slower than if I ONLY train this part and never train the prices-sub-graph.

我的问题是:我应该能够同时训练两个子图.但是可能我必须重新调整渐变，使其在编码结束时又回到隐藏状态.在这里，我们从价格子图和从解码器子图获得梯度.如何重新缩放.我没有找到任何描述这种工作的论文，但也许我在搜索错误的关键字.

My question is: I should be able to train both sub-graphs simultaneously. But probably I have to rescale the gradients flowing back into the hidden state at end of encoding. Here we get the gradients from the prices sub-graph AND from the decoder-sub-graph. How should this rescaling be done. I didnt find any papers describing such an undertaking, but maybe I am searching with the wrong keywords.

这是代码的训练部分:

这是(几乎是原始的)training-op-preparation:

This is the (almost original) training-op-preparation:

if not forward_only:
  self.gradient_norms = []
  self.updates = []
  opt = tf.train.AdadeltaOptimizer(self.learning_rate)

  for bucket_id in xrange(len(buckets)):
    tf.scalar_summary("seq2seq loss", self.losses[bucket_id])

    gradients = tf.gradients(self.losses[bucket_id], var_list_seq2seq)
    clipped_gradients, norm = tf.clip_by_global_norm(gradients, max_gradient_norm)
    self.gradient_norms.append(norm)
    self.updates.append(opt.apply_gradients(zip(clipped_gradients, var_list_seq2seq), global_step=self.global_step))

现在，另外，我正在运行第二个子图，该子图将编码结束时的隐藏状态作为输入:

Now, additionally, I am running a second sub-graph that takes the hidden state at end of encoding as input:

  with tf.name_scope('prices') as scope:
    #First layer
    W_price_first_layer = tf.Variable(tf.random_normal([num_layers*size, self.prices_hidden_layer_size], stddev=0.35), name="W_price_first_layer")
    B_price_first_layer = tf.Variable(tf.zeros([self.prices_hidden_layer_size]), name="B_price_first_layer")
    self.output_price_first_layer = tf.add(tf.matmul(self.hidden_state, W_price_first_layer), B_price_first_layer)
    self.activation_price_first_layer = tf.nn.sigmoid(self.output_price_first_layer)
    #self.activation_price_first_layer = tf.nn.Relu(self.output_price_first_layer)

    #Second layer to softmax (price ranges)
    W_price = tf.Variable(tf.random_normal([self.prices_hidden_layer_size, self.prices_bit_size], stddev=0.35), name="W_price")
    W_price_t = tf.transpose(W_price)
    B_price = tf.Variable(tf.zeros([self.prices_bit_size]), name="B_price")

    self.output_price_second_layer = tf.add(tf.matmul(self.activation_price_first_layer, W_price),B_price)
    self.price_prediction = tf.nn.softmax(self.output_price_second_layer)
    self.label_price      = tf.placeholder(tf.int32, shape=[self.batch_size], name="price_label")

    #Remember the prices trainables
    var_list_prices       = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, "prices")
    var_list_all          = tf.trainable_variables()

    #Backprop
    self.loss_price        = tf.nn.sparse_softmax_cross_entropy_with_logits(self.output_price_second_layer, self.label_price)
    self.loss_price_scalar = tf.reduce_mean(self.loss_price)
    self.optimizer_price   = tf.train.AdadeltaOptimizer(self.learning_rate_prices)
    self.training_op_price = self.optimizer_price.minimize(self.loss_price, var_list=var_list_all)

打一堆

同时训练两个子图时如何处理渐变 [英] How to handle gradients when training two sub-graphs simultaneously

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

同时训练两个子图时如何处理渐变 [英] How to handle gradients when training two sub-graphs simultaneously

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

登录关闭