通过同步手段和vars实现多GPU BN层的方法 [英] Ways to implement multi-GPU BN layers with synchronizing means and vars

查看:173
本文介绍了通过同步手段和vars实现多GPU BN层的方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道在使用多GPU训练时,通过同步批处理统计信息来实现批处理规范化层的可能方法。

I'd like to know the possible ways to implement batch normalization layers with synchronizing batch statistics when training with multi-GPU.

Caffe 也许有一些Caffe变体可以做到,例如链接。但是对于BN层,我的理解是它仍然仅同步层的输出,而不同步均值和变量。也许MPI可以同步均值和变量,但我认为MPI有点难以实现。

Caffe Maybe there are some variants of caffe that could do, like link. But for BN layer, my understanding is that it still synchronizes only the outputs of layers, not the means and vars. Maybe MPI can synchronizes means and vars but I think MPI is a little difficult to implemnt.

火炬我已经看到了一些评论此处此处,其中显示running_mean和running_var可以同步,但是我认为批均值和批var无法同步或很难同步。

Torch I've seen some comments here and here, which show the running_mean and running_var can be synchronized but I think batch mean and batch var can not or are difficult to synchronize.

Tensorflow 通常,它与caffe和火炬相同。 BN的实现参考。我知道tensorflow可以将操作分发给 tf.device()指定的任何设备。但是均值和变量的计算位于BN层的中间,因此,如果我在cpu中收集均值和变量,则我的代码将如下所示:

Tensorflow Normally, it is the same as caffe and torch. The implementation of BN refers this. I know tensorflow can distribute an operation to any device specified by tf.device(). But the computation of means and vars is in the middle of BN layer, so if I gather the means and vars in cpu, my code will be like this:

cpu_gather = []
label_batches = []
for i in range(num_gpu):
    with tf.device('/gpu:%d' % i):
        with tf.variable_scope('block1', reuse=i > 0):
            image_batch, label_batch = cifar_input.build_input('cifar10', train_data_path, batch_size, 'train')
            label_batches.append(label_batch)

            x = _conv('weights', image_batch, 3, 3, 16, _stride_arr(1))
            block1_gather.append(x)

with tf.device('/cpu:0'):
    print block1_gather[0].get_shape()
    x1 = tf.concat(block1_gather, 0)
    # print x1.get_shape()
    mean, variance = tf.nn.moments(x1, [0, 1, 2], name='moments')

for i in range(num_gpu):
    with tf.device('/gpu:%d' % i):
        with tf.variable_scope('block2', reuse=i > 0):
            shape = cpu_gather[i].get_shape().as_list()
            assert len(shape) in [2, 4]
            n_out = shape[-1]
            beta, gamma, moving_mean, moving_var = get_bn_variables(n_out, True, True)

            x = tf.nn.batch_normalization(
                cpu_gather[i], mean, variance, beta, gamma, 0.00001)

            x = _relu(x)

仅用于一个BN层。为了在cpu中收集统计信息,我必须破坏代码。如果我的BN层超过100个,那将很麻烦。

That is just for one BN layer. For gathering statistics in cpu, I have to break the code. If I have more than 100 BN layers, that will be cumbersome.

我不是那些图书馆的专家,所以也许有些误解,请随时指出我的错误。

I am not expert in those libraries so maybe there are some misunderstanding, feel free to point out my errors.

我不太在乎训练速度。我正在进行图像分割,这会占用大量GPU内存,而BN需要合理的批处理大小(例如大于16)才能获得稳定的统计信息。因此,使用多GPU是不可避免的。我认为,tensorflow可能是最佳选择,但我无法解决代码中断问题。

I do not care much about training speed. I am doing image segmentation which consumes much GPU memory and BN needs a reasonable batch size (e.g. larger than 16) for stable statistics. So using multi-GPU is inevitable. In my opinion, tensorflow might be the best choice but I can't resolve the breaking code problem. Solution with other libraries will be welcome too.

推荐答案

我不确定我是否完全理解您的问题,但是只要您设置好了适当地调整变量作用域, tf.GraphKeys.UPDATE_OPS 集合应为每个塔自动具有batch_norm的更新操作。如果所有update_ops均同步应用,则参数服务器将对它们进行隐式平均,您要做的就是确保在平均和应用渐变之前已应用更新。 (如果我正确理解了您的意图)。

I'm not sure if I fully understand your question, but provided you set up your variable scope properly, the tf.GraphKeys.UPDATE_OPS collection should automatically have the update ops for batch_norm for each of your towers. If all of the update_ops are applied synchronously, they will be implicitly averaged by the parameter server, all you have to do is make sure the updates are applied before you average and apply gradients. (If I understand your intentions correctly).

由于变量范围的原因,每组更新操作都会更新相同的变量,因此要同步更新操作,您需要做的所有事情在完整的更新操作集上进行梯度计算。您还应该将所有批处理规范层封装在一个 name_scope 中,以避免在 UPDATE_OPS 中抓住任何不必要的操作。下面的代码框架:

Because of variable scope each set of update ops will update the same variables, so to synchronize the update ops all you need to do is gate your gradient calculation on the complete set of update ops. You should also encapsulate all of your batch norm layers in a single name_scope to avoid grabbing any extraneous ops in UPDATE_OPS. Code skeleton below:

update_ops = []
for i, device in enumerate(devices):
  with tf.variable_scope('foo', reuse=bool(i > 0)):
    with tf.name_scope('tower_%d' % i) as name_scope:
      with tf.device(device):
        # Put as many batch_norm layers as you want here
      update_ops.extend(tf.get_collection(tf.GraphKeys.UPDATE_OPS,
                                          name_scope))
# make gradient calculation ops here
with tf.device(averaging_device):
  with tf.control_dependencies(update_ops):
    # average and apply gradients.

如果您想在某些现有代码上尝试,请尝试删除我== 0 行在这里: https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10_estimator/cifar10_main.py#L115

If you wanna try this on some existing code, try just deleting the if i == 0 line here: https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10_estimator/cifar10_main.py#L115

您会看到速度有所下降(由于这个原因,我们通常只使用一个塔来计算批处理规范统计信息),但是它应该可以执行您想要的操作。

You're going to see some slow down (we usually only use one tower to compute batch norm statistics for this reason), but it should do what you want.

这篇关于通过同步手段和vars实现多GPU BN层的方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆