Tensorflow Inception多个GPU训练损失未加总? [英] Tensorflow Inception Multiple GPU Training Loss is not Summed?

查看:47
本文介绍了Tensorflow Inception多个GPU训练损失未加总?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试针对多个GPU(在一台机器上)使用Tensorflow的初始代码.我很困惑,因为据我所知,我们从不同的塔楼(又名GPU)中遭受了多次损失,但是评估的 loss 变量似乎只是最后一个塔楼的损失,而不是所有塔楼损失的总和塔楼:

I am trying to go through Tensorflow's inception code for multiple GPUs (on 1 machine). I am confused because we get multiple losses from the different towers, aka the GPUs, as I understand, but the loss variable evaluated seems to only be of the last tower and not a sum of the losses from all towers:

for step in xrange(FLAGS.max_steps):
  start_time = time.time()
  _, loss_value = sess.run([train_op, loss])
  duration = time.time() - start_time

上一次专门为每个塔定义 loss 的位置

Where loss was last defined specifically for each tower:

for i in xrange(FLAGS.num_gpus):
  with tf.device('/gpu:%d' % i):
    with tf.name_scope('%s_%d' % (inception.TOWER_NAME, i)) as scope:
      # Force all Variables to reside on the CPU.
      with slim.arg_scope([slim.variables.variable], device='/cpu:0'):
        # Calculate the loss for one tower of the ImageNet model. This
        # function constructs the entire ImageNet model but shares the
        # variables across all towers.
        loss = _tower_loss(images_splits[i], labels_splits[i], num_classes,
                           scope)

有人可以解释将不同塔的损失合并在一起的步骤吗?还是我们仅仅是将单个塔的损失作为另一座塔的损失的代表?

Could someone explain where the step is to combine the losses from different towers? Or are we simply a single tower's loss as representative of the other tower's losses as well?

以下是代码的链接: https://github.com/tensorflow/models/blob/master/inception/inception/inception_train.py#L336

推荐答案

是的,根据此代码,不对GPU之间的损失进行求和或平均.每gpu的损失用于每个gpu(塔)内部以进行梯度计算.仅梯度是同步的.因此,仅对最后一个gpu处理的数据部分执行isnan测试.这不是至关重要的,但可能是一个限制.

Yes, according to this code, losses are not summed or averaged across gpus. Loss per gpu is used inside of each gpu (tower) for gradient calculation. Only gradients are synchronized. So the isnan test is only done for the portion of data processed by the last gpu. This is not crucial but can be a limitation.

如果确实需要,我认为您可以按照以下步骤获得跨GPU的平均损失:

If really needed, I think you can do as follows to get averaged loss cross gpus:

per_gpu_loss = []
for i in xrange(FLAGS.num_gpus):
    with tf.device('/gpu:%d' % i):
        with tf.name_scope('%s_%d' % (inception.TOWER_NAME, i)) as scope:
            ...
            per_gpu_loss.append(loss)

mean_loss = tf.reduce_mean(per_gpu_loss, name="mean_loss")
tf.summary.scalar('mean_loss', mean_loss)

,然后将sess.run中的损失替换为mean_loss:

and then replace loss in sess.run as mean_loss:

_, loss_value = sess.run([train_op, mean_loss])

loss_value现在是所有GPU处理的损失的平均值.

loss_value is now an average across losses processed by all the gpus.

这篇关于Tensorflow Inception多个GPU训练损失未加总?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆