深度学习的并行化策略 [英] Parallelization strategies for deep learning

查看:113
本文介绍了深度学习的并行化策略的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

什么并行化策略和形式可行并且可用于培训服务神经网络?



  • 内部计算机内核(例如GPU / TPU / CPU)

  • 网络或机架上的机器


我还在寻找证据,证明它们也可以用于例如TensorFlow,PyTorch或MXNet。


培训


据我所知,在大型数据集上训练大型神经网络时,至少可以拥有:


  1. 不同的核心 机器在以下计算机的不同部分上运行图 拆分)。例如。通过图本身的反向传播可以并行化,例如(因为我认为?) autodiff图始终是 DAG ,因此在不同的计算机上托管了不同的层。

  2. 不同的核心 机器不同样本数据 数据拆分)进行操作)。在SGD中,跨批次或样本的梯度计算也可以并行化(例如,可以在不同批次上独立计算梯度后将其合并)。我相信这也称为梯度累积(?)。

何时每种策略更适合哪种类型的问题或神经网络?现代图书馆支持哪些模式?并且可以将所有四个(2x2)策略结合起来吗? 异步培训

  • 同步培训


  • 但是我不知道确切指的是什么,例如是不同数据批次上的梯度的计算,还是不同子图上的梯度的计算?也许它完全指的是其他东西?


    服务


    如果网络很大,则预测/推断也可能很慢,并且模型可能投放时间不适合内存中的单台计算机。是否存在任何已知的可以处理此类模型的已知多核和多节点预测解决方案?

    解决方案

    相当广泛,我会尝试与


    在这里您可以看到来自上一层的 4 个独立路径,这些路径可以并行执行,并且只有 2 个同步点(过滤器串联上一层)。


    问题



    例如通过图本身的反向传播可以并行化,例如因为(我
    认为?)autodiff图始终是DAG,所以在不同的计算机上托管了不同的层,因此
    始终是DAG。


    没那么容易。通常根据损耗值来计算梯度,并且您需要了解较深层的梯度才能计算出较浅层的梯度。如上所述,如果您有独立的路径,它会更容易并且可能会有所帮助,但是在单个设备上会更容易。


    我相信这也称为梯度累积( ?)


    不,实际上是减少了多个设备的数量。您可以在


    这是数据并行化的全部减少,每个设备都会计算发送到所有其他设备并在那里反向传播的值。


    何时每种策略更适合哪种类型的问题或神经
    网络?


    描述了上面的例子,如果您有足够的数据并且样本很大,那么并行数据几乎总是可以的(最多 8k 个样本或更多样本可以一次完成,而无需非常艰苦奋斗)。


    现代图书馆支持哪种模式?


    tensorflow pytorch 都支持,最现代的nd维护的库以一种或另一种方式实现了这些功能


    可以将所有四种(2x2)策略组合在一起


    是的,您可以在机器之间和内部并行化模型和数据。


    同步与异步



    异步


    @Daniel 简而言之,但值得一提的是更新并不是完全分开的。这毫无意义,因为我们实际上是根据批次来训练 N 个不同的模型。


    相反,有一个全局参数空间,每个副本都应异步共享计算的更新(因此,向前传递,向后,使用优化器计算更新并将此更新共享给全局参数)。


    这种方法虽然存在一个问题:无法保证当一个工作人员计算出的向前通过时另一个工作人员更新了参数,因此,该更新是针对旧参数集计算的 ,这称为<陈旧梯度> 。因此,收敛可能会受到损害。


    其他方法是为每个工作人员计算 N 步和更新,然后将它们同步,


    这部分基于出色的博客文章,如果您有兴趣的话,绝对应该读一读(有更多关于陈旧性和一些解决方案的信息。)


    synchronous


    在前面大部分都提到过,不同的方法,但是PyTorch从网络收集输出并在它们上反向传播( torch.nn.parallel.DistributedDataParallel )[https://pytorch.org/docs/stable/nn.html #torch.nn.parallel.DistributedDataParallel]。顺便说一句。您应该单独使用它(不使用 torch.nn.DataParallel ),因为它可以克服Python的GIL问题。


    外带



    服务


    小型机型


    就像您追逐大型机型一样,我不会为较小的机型做任何选择,只是简要介绍一下。


    如果您要为了通过网络为多个用户提供服务,您需要某种方式来扩展您的架构(通常是GCP或AWS之类的云)。您可以使用 Kubernetes 和POD或预先分配一些服务器来处理请求来做到这一点,但是这种方法效率低下(少量的用户和运行中的服务器会产生无意义的成本,而大量的用户可能会中断基础架构,并且花费太长时间来处理请求)。


    其他方法是根据以下情况使用自动缩放无服务器方法。资源将根据每个请求提供,因此具有较大的扩展能力,而且在流量较低时您无需支付费用。您可以看到 Azure函数,因为它们正处于改进状态ML / DL任务,或 torchlambda 用于PyTorch (免责声明,我是作者)。


    大型模型


    如前所述,您可以在自定义代码中使用Kubernetes或准备使用的工具。


    在第一种情况下,您可以像训练一样传播模型,但只能进行前进通过。这样,甚至可以在网络上建立巨型模型(再次, GPT-3 具有175B参数),但需要大量工作。


    第二个是 @Daniel 提供了两种可能性。其他值得一提的可能是(请阅读相应的文档,因为它们具有很多功能):



    • KubeFlow -基于Kubernetes的多个框架(即自动缩放,多节点),训练,服务以及其他内容,与下面的MLFlow等其他事物连接

    • AWS SageMaker -受Python支持的培训和服务,由Amazon

    • MLFlow -多个框架,用于实验处理和服务

    • BentoML -多个框架,培训和服务


    对于PyTorch,您可以在此处阅读更多内容,虽然tensorflow具有很多服务功能通过扩展Tensorflow(TFX)框。


    < h1> OP的评论中的问题


    是否有任何形式的并行性在机器中比在机器中的
    更好


    并行性的最佳选择可能是在一台巨型计算机内,以最大程度地减少设备之间的传输。


    另外,后端也不同(在至少在PyTorch中可以选择( mpi gloo nccl ),但并非所有设备都支持在设备之间直接发送,接收,还原等数据(有些可能支持CPU到CPU,其他的则支持GPU到GPU)。如果设备之间没有直接链接,则必须先将这些设备复制到另一个设备,然后再复制到目标设备(例如,其他计算机上的GPU->主机上的CPU->主机上的GPU)。请参阅 pytorch信息


    更多数据网络越大,并行计算就应该越有利可图。如果整个数据集都可以安装在单个设备上,则无需并行化。此外,应该考虑互联网传输速度,网络可靠性等因素。这些成本可能会超过收益。


    通常,如果您有大量数据(例如,包含 1.000.000 图片的ImageNet或大样本(例如 2000x2000 的图片)。如果可能,请在单台机器内,以最大程度地减少机器之间的转移。仅在无法解决问题时才分发模型(例如,GPU不适合该模型)。否则不要这样做(训练MNIST时几乎没有一点要并行化,因为整个数据集很容易放入RAM中,并且读取速度最快)。


    为什么要麻烦构建自定义的ML专用硬件(例如TPU)?


    CPU并不是最适合高度并行计算(例如矩阵乘法)的+ CPU可能还要执行许多其他任务(例如数据加载),因此使用GPU是有意义的。


    由于GPU是在考虑图形的基础上创建的(因此进行代数转换),因此一些CPU职责,可以专门处理(与CPU相比,有更多的内核,但更简单,请参见 V100


    现在,TPU专为张量计算量身定制(主要是深度学习),起源于Google,与GPU相比仍然是WIP 。这些适用于某些类型的模型(主要是卷积神经网络),在这种情况下可以提高速度。另外,最好在此设备上使用最大批量(请参见此处)可以被 128 整除。您可以将其与NVidia的Tensor Cores技术(GPU)进行比较,后者可以很好地将批次(或层大小)除以 16 8 (精度分别为 float16 int8 )(尽管数值越高越好,取决于数字核心,确切的图形卡和许多其他内容,请参阅一些准则此处)。


    另一方面,尽管有两个主要框架,但TPU的支持仍然不是最好的支持它( tensorflow 正式,而PyTorch与 torch_xla 包)。


    通常,GPU是目前深度学习中的一个很好的默认选择,即TPU对于卷积重型架构,尽管可能会让人有些头疼。另外(再次感谢@Daniel),TPU的电源效率更高,因此在比较单个浮点运算的成本时应该更便宜。


    What strategies and forms of parallelization are feasible and available for training and serving a neural network?:

    • inside a machine across cores (e.g. GPU / TPU / CPU)
    • across machines on a network or a rack

    I'm also looking for evidence for how they may also be used in e.g. TensorFlow, PyTorch or MXNet.

    Training

    To my knowledge, when training large neural networks on large datasets, one could at least have:

    1. Different cores or machines operate on different parts of the graph ("graph splitting"). E.g. backpropagation through the graph itself can be parallelized e.g. by having different layers hosted on different machines since (I think?) the autodiff graph is always a DAG.
    2. Different cores or machines operate on different samples of data ("data splitting"). In SGD, the computation of gradients across batches or samples can also be parallelized (e.g. the gradients can be combined after computing them independently on different batches). I believe this is also called gradient accumulation (?).

    When is each strategy better for what type of problem or neural network? Which modes are supported by modern libraries? and can one combine all four (2x2) strategies?

    On top of that, I have read about:

    • Asynchronous training
    • Synchronous training

    but I don't know what exactly that refers to, e.g. is it the computation of gradients on different data batches or the computation of gradients on different subgraphs? Or perhaps it refers to something else altogether?

    Serving

    If the network is huge, prediction / inference may also be slow, and the model may not fit on a single machine in memory at serving time. Are there any known multi-core and multi-node prediction solutions that work that can handle such models?

    解决方案

    As the question is quite broad, I'll try to shed a little different light and touch on different topics than what was shown in @Daniel's in-depth answer.

    Training

    Data parallelization vs model parallelization

    As mentioned by @Daniel data parallelism is used way more often and is easier to do correctly. Major caveat of model parallelism is the need to wait for part of neural network and synchronization between them.

    Say you have a simple feedforward 5 layer neural network spread across 5 different GPUs, each layer for one device. In this case, during each forward pass each device has to wait for computations from the previous layers. In this simplistic case, copying data between devices and synchronization would take a lot longer and won't bring benefits.

    On the other hand, there are models better suited for model parallelization like Inception networks, see picture below:

    Here you can see 4 independent paths from previous layer which could go in parallel and only 2 synchronization points (Filter concatenation and Previous Layer).

    Questions

    E.g. backpropagation through the graph itself can be parallelized e.g. by having different layers hosted on different machines since (I think?) the autodiff graph is always a DAG.

    It's not that easy. Gradients are calculated based on the loss value (usually) and you need to know gradients of deeper layers to calculate gradients for the more shallow ones. As above, if you have independent paths it's easier and may help, but it's way easier on a single device.

    I believe this is also called gradient accumulation (?)

    No, it's actually reduction across multiple devices. You can see some of that in PyTorch tutorial. Gradient accumulation is when you run your forward pass (either on single or multiple devices) N times and backpropagate (the gradient is kept in the graph and the values are added during each pass) and optimizer only makes a single step to change neural network's weights (and clears the gradient). In this case, loss is usually divided by the number of steps without optimizer. This is used for more reliable gradient estimation, usually when you are unable to use large batches.

    Reduction across devices looks like this:

    This is all-reduce in data parallelization, each device calculates the values which are send to all other devices and backpropagated there.

    When is each strategy better for what type of problem or neural network?

    Described above, data parallel is almost always fine if you have enough of data and the samples are big (up to 8k samples or more can be done at once without very big struggle).

    Which modes are supported by modern libraries?

    tensorflow and pytorch both support either, most modern and maintained libraries have those functionalities implemented one way or another

    can one combine all four (2x2) strategies

    Yes, you can parallelize both model and data across and within machines.

    synchronous vs asynchronous

    asynchronous

    Described by @Daniel in brief, but it's worth mentioning updates are not totally separate. That would make little sense, as we would essentially train N different models based on their batches.

    Instead, there is a global parameter space, where each replica is supposed to share calculated updates asynchronously (so forward pass, backward, calculate update with optimizer and share this update to global params).

    This approach has one problem though: there is no guarantee that when one worker calculated forward pass another worker updated the parameters, so the update is calculated with respect to old set of params and this is called stale gradients. Due to this, convergence might be hurt.

    Other approach is to calculate N steps and updates for each worker and synchronize them afterwards, though it's not used as often.

    This part was based on great blogpost and you should definitely read it if interested (there is more about staleness and some solutions).

    synchronous

    Mostly described previously, there are different approaches but PyTorch gathers output from network and backpropagates on them (torch.nn.parallel.DistributedDataParallel)[https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel]. BTW. You should solely this (no torch.nn.DataParallel) as it overcomes Python's GIL problem.

    Takeaways

    • Data parallelization is always almost used when going for speed up as you "only" have to replicate neural network on each device (either over the network or within single machine), run part of batch on each during the forward pass, concatenate them into a single batch (synchronization) on one device and backpropagate on said.
    • There are multiple ways to do data parallelization, already introduced by @Daniel
    • Model parallelization is done when the model is too large to fit on single machine (OpenAI's GPT-3 would be an extreme case) or when the architecture is suited for this task, but both are rarely the case AFAIK.
    • The more and the longer parallel paths the model has (synchronization points), the better it might be suited for model parallelization
    • It's important to start workers at similar times with similar loads in order not to way for synchronization processes in synchronous approach or not to get stale gradients in asynchronous (though in the latter case it's not enough).

    Serving

    Small models

    As you are after large models I won't delve into options for smaller ones, just a brief mention.

    If you want to serve multiple users over the network you need some way to scale your architecture (usually cloud like GCP or AWS). You could do that using Kubernetes and it's PODs or pre-allocate some servers to handle requests, but that approach would be inefficient (small number of users and running servers would generate pointless costs, while large numbers of users may halt the infrastructure and take too long to process resuests).

    Other way is to use autoscaling based on serverless approach. Resources will be provided based on each request so it has large scaling abilities + you don't pay when the traffic is low. You can see Azure Functions as they are on the path to improve it for ML/DL tasks, or torchlambda for PyTorch (disclaimer, I'm the author) for smaller models.

    Large models

    As mentioned previously, you could use Kubernetes with your custom code or ready to use tools.

    In the first case, you can spread the model just the same as for training, but only do forward pass. In this way even giant models can be put up on the network (once again, GPT-3 with 175B parameters), but requires a lot of work.

    In the second, @Daniel provided two possibilities. Others worth mentioning could be (read respective docs as those have a lot of functionalities):

    • KubeFlow - multiple frameworks, based on Kubernetes (so auto-scaling, multi-node), training, serving and what not, connects with other things like MLFlow below
    • AWS SageMaker - training and serving with Python API, supported by Amazon
    • MLFlow - multiple frameworks, for experiment handling and serving
    • BentoML - multiple frameworks, training and serving

    For PyTorch, you could read more here, while tensorflow has a lot of serving functionality out of the box via Tensorflow EXtended (TFX).

    Questions from OP's comment

    Are there any forms of parallelism that are better within a machine vs across machines

    The best for of parallelism would probably be within one giant computer as to minimize transfer between devices.

    Additionally, there are different backends (at least in PyTorch) one can choose from (mpi, gloo, nccl) and not all of them support direct sending, receiving, reducing etc. data between devices (some may support CPU to CPU, others GPU to GPU). If there is no direct link between devices, those have to be first copied to another device and copied again to target device (e.g. GPU on other machine -> CPU on host -> GPU on host). See pytorch info.

    The more data and the bigger network, the more profitable it should be to parallelize computations. If whole dataset can be fit on a single device there is no need for parallelization. Additionally, one should take into account things like internet transfer speed, network reliability etc. Those costs may outweigh benefits.

    In general, go for data parallelization if you have lots of of data (say ImageNet with 1.000.000 images) or big samples (say images 2000x2000). If possible, within a single machine as to minimize between-machines transfer. Distribute model only if there is no way around it (e.g. it doesn't fit on GPU). Don't otherwise (there is little to no point to parallelize when training MNIST as the whole dataset will easily fit in RAM and the read will be fastest from it).

    why bother build custom ML-specific hardware such as TPUs?

    CPUs are not the best suited for highly parallel computations (e.g. matrices multiplication) + CPU may be occupied with many other tasks (like data loading), hence it makes sense to use GPU.

    As GPU was created with graphics in mind (so algebraic transformation), it can take some of CPU duties and can be specialized (many more cores when compared to CPU but simpler ones, see V100 for example).

    Now, TPUs are tailored specificially for tensor computations (so deep learning mainly) and originated in Google, still WIP when compared to GPUs. Those are suited for certain types of models (mainly convolutional neural networks) and can bring speedups in this case. Additionally one should use the largest batches with this device (see here), best to be divisible by 128. You can compare that to NVidia's Tensor Cores technology (GPU) where you are fine with batches (or layer sizes) divisible by 16 or 8 (float16 precision and int8 respectively) for good utilization (although the more the better and depends on number of cores, exact graphic card and many other stuff, see some guidelines here).

    On the other hand, TPUs support still isn't the best, although two major frameworks support it (tensorflow officially, while PyTorch with torch_xla package).

    In general, GPU is a good default choice in deep learning right now, TPUs for convolution heavy architectures, though might give some headache tbh. Also (once again thanks @Daniel), TPUs are more power effective, hence should be cheaper when comparing single floating point operation cost.

    这篇关于深度学习的并行化策略的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆