在PyTorch培训之外使用多个GPU [英] Using Multiple GPUs outside of training in PyTorch

查看:48
本文介绍了在PyTorch培训之外使用多个GPU的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在计算nn.Conv2d层内每对内核之间的累积距离.但是,对于较大的层,使用具有12gb内存的Titan X会耗尽内存.我想知道是否可以将这样的计算结果划分为两个GPU.代码如下:

I'm calculating the accumulated distance between each pair of kernel inside a nn.Conv2d layer. However for large layers it runs out of memory using a Titan X with 12gb of memory. I'd like to know if it is possible to divide such calculations across two gpus. The code follows:

def ac_distance(layer):
    total = 0
    for p in layer.weight:
      for q in layer.weight:
         total += distance(p,q)
    return total

其中 layer nn.Conv2d 的实例,而distance返回p和q之差的总和.但是,我无法分离该图,因为稍后需要它.我尝试将模型包装在nn.DataParallel周围,但 ac_distance 中的所有计算仅使用1 gpu完成,但是同时使用这两者进行训练.

Where layer is instance of nn.Conv2d and distance returns the sum of the differences between p and q. I can't detach the graph, however, for I need it later on. I tried wrapping my model around a nn.DataParallel, but all calculations in ac_distance are done using only 1 gpu, however it trains using both.

推荐答案

同时训练神经网络的并行性可以通过两种方式实现.

Parallelism while training neural networks can be achieved in two ways.

  1. 数据并行化-将大批处理分为两部分,并分别在两个不同的GPU上分别进行相同的操作
  2. 模型并行性-拆分计算并在不同的GPU上运行

正如您在问题中所提出的,您想将计算结果划分为第二类.没有开箱即用的方式来实现模型并行性.PyTorch使用 torch.distributed 包为并行处理提供了原语.此教程全面介绍了该软件包的详细信息,您可以制定一种解决方案实现所需的模型并行性.

As you have asked in the question, you would like to split the calculation which falls into the second category. There are no out-of-the-box ways to achieve model parallelism. PyTorch provides primitives for parallel processing using the torch.distributed package. This tutorial comprehensively goes through the details of the package and you can cook up an approach to achieve model parallelism that you need.

但是,实现模型并行性可能非常复杂.通常的方法是使用 torch.nn.DataParallel torch.nn.DistributedDataParallel 进行数据并行化.在这两种方法中,您将在两个不同的GPU上运行相同的模型,但是一大批将被分成两个较小的块.梯度将累积在单个GPU上并进行优化.通过使用多处理,优化在 Dataparallel 中的单个GPU上进行,并且在 DistributedDataParallel 中的GPU之间进行并行化.

However, model parallelism can be very complex to achieve. The general way is to do data parallelism with either torch.nn.DataParallel or torch.nn.DistributedDataParallel. In both the methods, you would run the same model on two different GPUs, however one huge batch would be split into two smaller chunks. The gradients will be accumulated on a single GPU and optimization happens. Optimization takes place on a single GPU in Dataparallel and parallely across GPUs in DistributedDataParallel by using multiprocessing.

在您的情况下,如果您使用 DataParallel ,则计算仍将在两个不同的GPU上进行.如果您发现GPU使用情况不平衡,可能是由于 DataParallel 的设计方式所致.您可以尝试使用 DistributedDataParallel ,这是根据文档.

In your case, if you use DataParallel, the computation would still take place on two different GPUs. If you notice imbalance in GPU usage it could be because of the way DataParallel has been designed. You can try using DistributedDataParallel which is the fastest way to train on multiple GPUs according to the docs.

还有其他方法可以处理非常大的批次.此文章详细介绍了它们,我相信这会有所帮助.几个要点:

There are other ways to process very large batches too. This article goes through them in detail and I'm sure it would be helpful. Few important points:

  • 对较大的批次进行梯度累积
  • 使用DataParallel
  • 如果这还不够,请使用DistributedDataParallel

这篇关于在PyTorch培训之外使用多个GPU的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆