Tensorflow 时间线显示梯度平均值是使用多个 GPU 时的性能瓶颈 [英] Tensorflow timeline shows the gradients average is the performance bottleneck when using multiple GPUs

查看:26
本文介绍了Tensorflow 时间线显示梯度平均值是使用多个 GPU 时的性能瓶颈的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用多个(实际上是 2 个)GPU 来训练网络.网络运行良好,但我发现训练速度波动.

I use multiple (actually 2) GPUs to train a network. The network works well but I found the training speed fluctuates.

这是我用于分析的 snipet:

This is the snipet I used for profiling:

for i in range(resume_epoch, c.num_epochs):
    print("Epoch %d" % i)
    sess.run(train_itr.initializer)
    num_batches = num_egs // c.batch_size
    for batch in range(num_batches):
        start_time = time.time()
        _, loss_value = sess.run([train_op, loss])
        duration = time.time() - start_time
        examples_per_sec = c.batch_size / float(duration)
        print('step %d, loss = %.2f (%.1f examples/sec; %.3f '
              'sec/batch)' % (step, loss_value, examples_per_sec, duration))

这是输出:

...
step 5100, loss = 4.71 (556.3 examples/sec; 0.230 sec/batch)
step 5200, loss = 4.14 (341.9 examples/sec; 0.374 sec/batch)
step 5300, loss = 4.63 (363.4 examples/sec; 0.352 sec/batch)
step 5400, loss = 4.82 (176.0 examples/sec; 0.727 sec/batch)

最快的步骤可以处理近 600 个样本/秒,而如上图所示,它也可以慢至约 200 个样本/秒.

The fastest step can process almost 600 examples/sec, while as shown above, it can also be as slow as ~200 examples/sec.

一开始,我怀疑输入管道可能是瓶颈.我使用 tf.data 来处理输入特征,将它们拆分并提供给不同的 GPU 塔.代码如下:

At the very beginning, I suspected the input pipeline may be the bottleneck. I use the tf.data to process the input features, split and feed them to the different GPU towers. Here is the code:

def create_variable_train_dataset(filenames, batch_size, feat_dim, shuffle_size=-1):
    dataset = tf.data.Dataset.from_tensor_slices(filenames).shuffle(50)
    dataset = dataset.interleave(lambda filename:
                                 tf.data.TFRecordDataset(filename).map(
                                 _parse_tfrecord, num_parallel_calls=8).shuffle(shuffle_size).apply(                                   
                            tf.contrib.data.padded_batch_and_drop_remainder(
                                   batch_size,
                                   padded_shapes=({'input': [None, feat_dim], 'input_shape': [2], 'output': []}))),
                            cycle_length=len(filenames), block_length=1
                           )

    dataset = dataset.prefetch(5)
    itr = dataset.make_initializable_iterator()
    element = itr.get_next()
    return itr, element['input'], element['output']

在主函数中:

train_itr, train_feature, train_label = create_variable_train_dataset(train_filenames,
                                                                          batch_size=c.batch_size,
                                                                          feat_dim=feat_dim,
                                                                          shuffle_size=400000//len(train_filenames))
features_splits = tf.split(train_feature, num_or_size_splits=c.num_gpus, axis=0)

tower_grads = []
reuse_variables = None
for i in range(c.num_gpus):
    with tf.device(assign_to_device('/gpu:{}'.format(i), ps_device=c.local_ps_device)):
        with tf.name_scope('tower_%d' % i) as scope:
            loss = _tower_loss(features_splits[i], labels_splits[i], num_classes, scope, reuse_variables)
            reuse_variables = True
            grad = ...some_function_to_compute_grad
            tower_grads.append(grads)
grads = _average_gradients(tower_grads)

_tower_loss 是在不同的 GPU 中使塔损失的函数,而参数保存在 CPU 中.

_tower_loss is a function to make tower loss in different GPUs, while the parameters are kept in CPU.

def _tower_loss(features, labels, num_classes, scope, reuse_variables=None):
    # Build inference Graph.
    with tf.variable_scope(tf.get_variable_scope(), reuse=reuse_variables):
        logits = inference(features, num_classes, is_training=True, scope=scope)

    tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits, scope="loss")

    losses = tf.get_collection(tf.GraphKeys.LOSSES, scope)
    regularization_losses = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
    total_loss = tf.add_n(losses + regularization_losses, name='total_loss')

    # Compute the moving average of all individual losses and the total loss.
    loss_averages = tf.train.ExponentialMovingAverage(0.9, name='avg')
    loss_averages_op = loss_averages.apply(losses + [total_loss])

    with tf.control_dependencies([loss_averages_op]):
        total_loss = tf.identity(total_loss)

    return total_loss

接下来,我使用时间轴工具检查训练期间的时间流逝.令我惊讶的是,CPU 需要很长时间.这是我所做的.

Next, I used the Timeline tool to inspect the time elapse during training. To my surprise, the CPU takes really long time. Here is what I have done.

start_time = time.time()
if step % 100 == 0:
    _, loss_value = sess.run([train_op, loss], options=run_options, run_metadata=run_metadata)
    duration = time.time() - start_time
    # Create the Timeline object, and write it to a json
    tl = timeline.Timeline(run_metadata.step_stats)
    ctf = tl.generate_chrome_trace_format()
    with open('timeline.json', 'w') as f:
        f.write(ctf)
 else:
    _, loss_value = sess.run([train_op, loss])
    duration = time.time() - start_time

这是上面最后一步的结果(步骤 5400,损失 = 4.82(176.0 个样本/秒;0.727 秒/批)):时间线结果

Here is the result of the last step above (step 5400, loss = 4.82 (176.0 examples/sec; 0.727 sec/batch)): timeline result

如您所见,CPU:0 需要很长时间.扩展cpu操作

As you see, CPU:0 takes really long time. Expand the cpu operations

Concat、Mean 和 ApplyAdam() 花费的时间最多.它们来自 _average_gradients 函数:

The Concat, Mean and ApplyAdam() take the most time. They come from the _average_gradients function:

def _average_gradients(tower_grads):
    average_grads = []
    for grad_and_vars in zip(*tower_grads):
        # Note that each grad_and_vars looks like the following:
        #   ((grad0_gpu0, var0_gpu0), ... , (grad0_gpuN, var0_gpuN))
        grads = []
        for g, _ in grad_and_vars:
            # Add 0 dimension to the gradients to represent the tower.
            expanded_g = tf.expand_dims(g, 0)

            # Append on a 'tower' dimension which we will average over below.
            grads.append(expanded_g)

        # Average over the 'tower' dimension.
        grad = tf.concat(axis=0, values=grads)
        grad = tf.reduce_mean(grad, 0)

        # Keep in mind that the Variables are redundant because they are shared
        # across towers. So .. we will just return the first tower's pointer to
        # the Variable.
        v = grad_and_vars[0][1]
        grad_and_var = (grad, v)
        average_grads.append(grad_and_var)
    return average_grads
...
grads = _average_gradients(tower_grads)
apply_gradient_op = opt.apply_gradients(grads, global_step=global_step)

这是合理的,因为梯度应该在 GPU 计算之后进行平均.但是我怎样才能提高性能呢?我通过参考 tensorflow 中的初始示例来实现我的模型.我使用 tensorflow 1.4.0.

This is reasonable because the gradients should be averaged after GPU computation. But how can I improve the performance? I implement my model by refering to the inception example in tensorflow. I use tensorflow 1.4.0.

有什么提高训练速度的建议吗?

Any advice to improve the training speed?

如果有任何其他代码、文件或信息有助于解决此问题,请告诉我.

If any other codes, files or information is helpful to solve this problem, please let me know.

推荐答案

我尝试将梯度平均值和梯度下降移动到 GPU:0.因为我的 GPU 有 peer2peer 连接,所以数据移动很快,GPU 中的计算也很快.将所有这些操作放在第一个 GPU 中几乎可以解决我的问题.如果有人有其他意见,欢迎 :D

I try to move the gradient average and gradient descend to GPU:0. Because my GPUs have peer2peer connections, the data move is fast and the computation in GPU is also fast. Place all these ops in the first GPU nearly solve my problem. It is welcomed if anyone has other comments :D

这篇关于Tensorflow 时间线显示梯度平均值是使用多个 GPU 时的性能瓶颈的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆