如何加快张量流的训练速度? [英] How to speedup rnn training speed of tensorflow?

查看：118 发布时间：2020/5/4 6:21:53 tensorflow lstm

本文介绍了如何加快张量流的训练速度?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

现在开始使用 tensorflow-char-rnn 下一个字. 但是我发现火车数据集中的速度太慢了.这是我的培训详细信息:

训练数据大小:10亿个字
词汇量:75万
RNN模型:lstm
RNN层:2
单元格大小:200
序列长度:20
批处理大小:40(批处理大小太大会导致OOM异常)

机器详细信息:

Amazon p2实例
1核心K80 GPU
16G显存
4核CPU
60G内存

在我的测试中，训练数据1个纪元的时间需要17天！确实太慢了，然后我将seq2seq.rnn_decoder更改为tf.nn.dynamic_rnn，但是时间仍然是17天.

我想找到太慢的原因是我的代码引起的，还是一直如此慢? 因为我听说有传言称Tensorflow rnn比其他DL Framework慢.

这是我的模型代码:

class SeqModel():
def __init__(self, config, infer=False):
    self.args = config
    if infer:
        config.batch_size = 1
        config.seq_length = 1

    if config.model == 'rnn':
        cell_fn = rnn_cell.BasicRNNCell
    elif config.model == 'gru':
        cell_fn = rnn_cell.GRUCell
    elif config.model == 'lstm':
        cell_fn = rnn_cell.BasicLSTMCell
    else:
        raise Exception("model type not supported: {}".format(config.model))

    cell = cell_fn(config.hidden_size)

    self.cell = cell = rnn_cell.MultiRNNCell([cell] * config.num_layers)

    self.input_data = tf.placeholder(tf.int32, [config.batch_size, config.seq_length])
    self.targets = tf.placeholder(tf.int32, [config.batch_size, config.seq_length])
    self.initial_state = cell.zero_state(config.batch_size, tf.float32)

    with tf.variable_scope('rnnlm'):
        softmax_w = tf.get_variable("softmax_w", [config.hidden_size, config.vocab_size])
        softmax_b = tf.get_variable("softmax_b", [config.vocab_size])

        embedding = tf.get_variable("embedding", [config.vocab_size, config.hidden_size])
        inputs = tf.nn.embedding_lookup(embedding, self.input_data)


    outputs, last_state = tf.nn.dynamic_rnn(cell, inputs, initial_state=self.initial_state)

    # [seq_size * batch_size, hidden_size]
    output = tf.reshape(tf.concat(1, outputs), [-1, config.hidden_size])

    self.logits = tf.matmul(output, softmax_w) + softmax_b
    self.probs = tf.nn.softmax(self.logits)

    self.final_state = last_state


    loss = seq2seq.sequence_loss_by_example([self.logits],
                                            [tf.reshape(self.targets, [-1])],
                                            [tf.ones([config.batch_size * config.seq_length])],
                                            config.vocab_size)
    self.cost = tf.reduce_sum(loss) / config.batch_size / config.seq_length

    self.lr = tf.Variable(0.0, trainable=False)
    tvars = tf.trainable_variables()
    grads, _ = tf.clip_by_global_norm(tf.gradients(self.cost, tvars),
                                      config.grad_clip)
    optimizer = tf.train.AdamOptimizer(self.lr)
    self.train_op = optimizer.apply_gradients(zip(grads, tvars))

这是训练期间的GPU负载

非常感谢.

解决方案

正如您提到的 batch_size 确实很重要，它可以带来令人印象深刻的加速效果，但请确保您的困惑保持相关性.

监视GPU活动可以给您一些有关潜在I/O瓶颈的提示.

最重要的是，使用采样softmax 代替常规softmax更快.这将要求您使用[config.vocab_size, config.hidden_size]权重矩阵而不是[config.hidden_size, config.vocab_size].这绝对是我的观点.

希望这会有所帮助.

pltrdy

Now base tensorflow-char-rnn I start a word-rnn project to predict the next word. But I found that speed is too slow in my train data set. Here is my training details:

Training data size: 1 billion words
Vocabulary size: 0.75 millions
RNN model: lstm
RNN Layer: 2
Cell size: 200
Seq length: 20
Batch size: 40 (too big batch size will be cause OOM exception)

The machine details:

Amazon p2 instance
1 core K80 GPU
16G video memory
4 core CPU
60G memory

In my test, the time of training data 1 epoch is need 17 days! It’s is really too slow, and then I change the seq2seq.rnn_decoder to tf.nn.dynamic_rnn, but the time is still 17 days.

I want to find the too slow reason is caused by my code or it has always been so slow? Because I heard some rumors that Tensorflow rnn is slower than other DL Framework.

This is my model code:

class SeqModel():
def __init__(self, config, infer=False):
    self.args = config
    if infer:
        config.batch_size = 1
        config.seq_length = 1

    if config.model == 'rnn':
        cell_fn = rnn_cell.BasicRNNCell
    elif config.model == 'gru':
        cell_fn = rnn_cell.GRUCell
    elif config.model == 'lstm':
        cell_fn = rnn_cell.BasicLSTMCell
    else:
        raise Exception("model type not supported: {}".format(config.model))

    cell = cell_fn(config.hidden_size)

    self.cell = cell = rnn_cell.MultiRNNCell([cell] * config.num_layers)

    self.input_data = tf.placeholder(tf.int32, [config.batch_size, config.seq_length])
    self.targets = tf.placeholder(tf.int32, [config.batch_size, config.seq_length])
    self.initial_state = cell.zero_state(config.batch_size, tf.float32)

    with tf.variable_scope('rnnlm'):
        softmax_w = tf.get_variable("softmax_w", [config.hidden_size, config.vocab_size])
        softmax_b = tf.get_variable("softmax_b", [config.vocab_size])

        embedding = tf.get_variable("embedding", [config.vocab_size, config.hidden_size])
        inputs = tf.nn.embedding_lookup(embedding, self.input_data)


    outputs, last_state = tf.nn.dynamic_rnn(cell, inputs, initial_state=self.initial_state)

    # [seq_size * batch_size, hidden_size]
    output = tf.reshape(tf.concat(1, outputs), [-1, config.hidden_size])

    self.logits = tf.matmul(output, softmax_w) + softmax_b
    self.probs = tf.nn.softmax(self.logits)

    self.final_state = last_state


    loss = seq2seq.sequence_loss_by_example([self.logits],
                                            [tf.reshape(self.targets, [-1])],
                                            [tf.ones([config.batch_size * config.seq_length])],
                                            config.vocab_size)
    self.cost = tf.reduce_sum(loss) / config.batch_size / config.seq_length

    self.lr = tf.Variable(0.0, trainable=False)
    tvars = tf.trainable_variables()
    grads, _ = tf.clip_by_global_norm(tf.gradients(self.cost, tvars),
                                      config.grad_clip)
    optimizer = tf.train.AdamOptimizer(self.lr)
    self.train_op = optimizer.apply_gradients(zip(grads, tvars))

Here is the GPU load during the training

Thanks very much.

解决方案

As you mentionned batch_size is really important to tune, it can lead to impressive speedup but check that your perplexity keeps relevant.

Monitoring your GPU activity can you give you hints about potential I/O bottleneck.

Most importantly, using sampled softmax instead of regular softmax is way faster. This would require you to use a [config.vocab_size, config.hidden_size] weight matrix instead of you [config.hidden_size, config.vocab_size]. This is definitely the way to go to my point of view.

Hope this helps.

pltrdy

这篇关于如何加快张量流的训练速度?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何加快张量流的训练速度? [英] How to speedup rnn training speed of tensorflow?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何加快张量流的训练速度? [英] How to speedup rnn training speed of tensorflow?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭