Hogwild 的 PyTorch 多处理错误 [英] PyTorch multiprocessing error with Hogwild

查看:25
本文介绍了Hogwild 的 PyTorch 多处理错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在尝试使用 torch.multiprocessing 实现 Hogwild 时遇到了一个神秘的错误.特别是,代码的一个版本运行良好,但是当我在多处理步骤之前添加看似无关的代码位时,这会在多处理步骤中以某种方式导致错误:RuntimeError: Unable to handle autograd's threading in combine with基于 fork 的多处理.见 https://github.com/pytorch/pytorch/wiki/Autograd-and-Fork

I've encountered a mysterious bug while trying to implement Hogwild with torch.multiprocessing. In particular, one version of the code runs fine, but when I add in a seemingly unrelated bit of code before the multiprocessing step, this somehow causes an error during the multiprocessing step: RuntimeError: Unable to handle autograd's threading in combination with fork-based multiprocessing. See https://github.com/pytorch/pytorch/wiki/Autograd-and-Fork

我在下面粘贴的最小代码示例中重现了错误.如果我注释掉两行代码 m0 = Model();train(m0) 在单独的模型实例上执行非并行训练,然后一切正常.我无法弄清楚这些行是如何导致问题的.

I reproduced the error in a minimal code sample, pasted below. If I comment out the two lines of code m0 = Model(); train(m0) which carry out a non-parallel training run on a separate model instance, then everything runs fine. I can't figure out how these lines could be causing a problem.

我在 Linux 机器上运行 PyTorch 1.5.1 和 Python 3.7.6,仅在 CPU 上进行训练.

I'm running PyTorch 1.5.1 and Python 3.7.6 on a Linux machine, training on CPU only.

import torch
import torch.multiprocessing as mp
from torch import nn

def train(model):
    opt = torch.optim.Adam(model.parameters(), lr=1e-5)
    for _ in range(10000):
        opt.zero_grad()
        # We train the model to output the value 4 (arbitrarily)
        loss = (model(0) - 4)**2
        loss.backward()
        opt.step()

# Toy model with one parameter tensor of size 3.
# Output is always the sum of the elements in the tensor,
# independent of the input
class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.x = nn.Parameter(torch.ones(3))

    def forward(self, x):
        return torch.sum(self.x)

############################################
# Create a separate Model instance and run
# a non-parallel training run.
# For some reason, this code causes the 
# subsequent parallel run to fail.
m0 = Model()
train(m0)
print ('Done with preliminary run')
############################################

num_processes = 2
model = Model()
model.share_memory()
processes = []
for rank in range(num_processes):
    p = mp.Process(target=train, args=(model,))
    p.start()
    processes.append(p)
for p in processes:
    p.join()
    
print(model.x)

推荐答案

如果您修改代码以创建这样的新进程:

If you modify your code to create new processes like this:

processes = []
ctx = mp.get_context('spawn')
for rank in range(num_processes):
    p = ctx.Process(target=train, args=(model,))

它似乎运行良好(其余代码与您的相同,已在 pytorch 1.5.0/python 3.6/NVIDIA T4 GPU 上测试).

it seems to run fine (rest of code same as yours, tested on pytorch 1.5.0 / python 3.6 / NVIDIA T4 GPU).

我不完全确定从非并行运行到并行运行会带来什么;我尝试为两次运行创建一个全新的模型(带有自己的类),和/或从原始模型中删除任何内容,和/或确保删除任何张量并释放内存,但这些都没有任何区别.

I'm not completely sure what is carried over from the non-parallel run to the parallel run; I tried creating a completely new model for the two runs (with its own class), and/or deleting anything from the original, and/or making sure to delete any tensors and free up memory, and none of that made any difference.

真正有区别的是确保 .backward() 在被 .backward() 内的函数调用之前不会在 mp.Process() 之外被调用>mp.Process().我认为可能是一个 autograd 线程;如果线程在使用默认 fork 方法进行多处理之前存在,它会失败,如果线程是在 fork 之后创建的,它似乎可以正常工作,如果使用 spawn 它也可以正常工作.

What did make a difference was making sure that .backward() never got called outside of mp.Process() before it was called by a function within mp.Process(). I think what may be carried over is an autograd thread; if the thread exists before multiprocessing with the default fork method it fails, if the thread is created after fork it seems to work okay, and if using spawn it also works okay.

顺便说一句:这是一个非常有趣的问题 - 特别感谢您将其消化为一个最小的例子!

Btw: That's a really interesting question - thank you especially for digesting it to a minimal example!

这篇关于Hogwild 的 PyTorch 多处理错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆