在PyTorch培训之外使用多个GPU [英] Using Multiple GPUs outside of training in PyTorch
问题描述
我正在计算nn.Conv2d层内每对内核之间的累积距离.但是,对于较大的层,使用具有12gb内存的Titan X会耗尽内存.我想知道是否可以将这样的计算结果划分为两个GPU.代码如下:
I'm calculating the accumulated distance between each pair of kernel inside a nn.Conv2d layer. However for large layers it runs out of memory using a Titan X with 12gb of memory. I'd like to know if it is possible to divide such calculations across two gpus. The code follows:
def ac_distance(layer):
total = 0
for p in layer.weight:
for q in layer.weight:
total += distance(p,q)
return total
其中 layer
是 nn.Conv2d
的实例,而distance返回p和q之差的总和.但是,我无法分离该图,因为稍后需要它.我尝试将模型包装在nn.DataParallel周围,但 ac_distance
中的所有计算仅使用1 gpu完成,但是同时使用这两者进行训练.
Where layer
is instance of nn.Conv2d
and distance returns the sum of the differences between p and q. I can't detach the graph, however, for I need it later on. I tried wrapping my model around a nn.DataParallel, but all calculations in ac_distance
are done using only 1 gpu, however it trains using both.
推荐答案
同时训练神经网络的并行性可以通过两种方式实现.
Parallelism while training neural networks can be achieved in two ways.
- 数据并行化-将大批处理分为两部分,并分别在两个不同的GPU上分别进行相同的操作
- 模型并行性-拆分计算并在不同的GPU上运行
正如您在问题中所提出的,您想将计算结果划分为第二类.没有开箱即用的方式来实现模型并行性.PyTorch使用 torch.distributed
包为并行处理提供了原语.此教程全面介绍了该软件包的详细信息,您可以制定一种解决方案实现所需的模型并行性.
As you have asked in the question, you would like to split the calculation which falls into the second category. There are no out-of-the-box ways to achieve model parallelism. PyTorch provides primitives for parallel processing using the torch.distributed
package. This tutorial comprehensively goes through the details of the package and you can cook up an approach to achieve model parallelism that you need.
但是,实现模型并行性可能非常复杂.通常的方法是使用 torch.nn.DataParallel
或 torch.nn.DistributedDataParallel
进行数据并行化.在这两种方法中,您将在两个不同的GPU上运行相同的模型,但是一大批将被分成两个较小的块.梯度将累积在单个GPU上并进行优化.通过使用多处理,优化在 Dataparallel
中的单个GPU上进行,并且在 DistributedDataParallel
中的GPU之间进行并行化.
However, model parallelism can be very complex to achieve. The general way is to do data parallelism with either torch.nn.DataParallel
or torch.nn.DistributedDataParallel
. In both the methods, you would run the same model on two different GPUs, however one huge batch would be split into two smaller chunks. The gradients will be accumulated on a single GPU and optimization happens. Optimization takes place on a single GPU in Dataparallel
and parallely across GPUs in DistributedDataParallel
by using multiprocessing.
在您的情况下,如果您使用 DataParallel
,则计算仍将在两个不同的GPU上进行.如果您发现GPU使用情况不平衡,可能是由于 DataParallel
的设计方式所致.您可以尝试使用 DistributedDataParallel
,这是根据文档.
In your case, if you use DataParallel
, the computation would still take place on two different GPUs. If you notice imbalance in GPU usage it could be because of the way DataParallel
has been designed. You can try using DistributedDataParallel
which is the fastest way to train on multiple GPUs according to the docs.
还有其他方法可以处理非常大的批次.此文章详细介绍了它们,我相信这会有所帮助.几个要点:
There are other ways to process very large batches too. This article goes through them in detail and I'm sure it would be helpful. Few important points:
- 对较大的批次进行梯度累积
- 使用DataParallel
- 如果这还不够,请使用DistributedDataParallel
这篇关于在PyTorch培训之外使用多个GPU的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!