并行化依赖随机梯度下降的 ML 模型? [英] Parallelizing ML models which rely on stochastic gradient descent?

查看:35
本文介绍了并行化依赖随机梯度下降的 ML 模型?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在阅读不同的 NLP 模型,如 word2vec 和 GloVe,以及如何将它们并行化,因为它们大多只是点积.但是,我对此有点困惑,因为计算梯度 &更新模型取决于参数/向量的当前值.这是如何并行/异步完成的?您如何知道何时使用每个线程随机计算的梯度来更新全局参数?

I have been reading about different NLP models like word2vec and GloVe, and how these can be parallelized because they are mostly just dot products. However, I am a bit confused by this, because computing the gradient & updating the model depends on the current values of the parameters/vectors. How is this done in parallel/asynchronously? How do you know when to update the global parameters using the gradients being computed stochastically by each of the threads?

推荐答案

一般来说,在节点之间做一些近似和有一些滞后/漂移的事情不会有太大的伤害.两篇重要的早期论文是:

Generally, doing everything approximately and with some lags/drift between nodes doesn't hurt that much. Two of the key early papers were:

"HOGWILD!:并行化的无锁方法随机梯度下降

"HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent"

作者:Benjamin Recht、Christopher Re、Stephen Wright、牛风

by Benjamin Recht, Christopher Re, Stephen Wright, Feng Niu

https://papers.nips.cc/paper/2011/hash/218a0aefd1d1a4be65601cc6ddc1520e-Abstract.html

摘要:随机梯度下降 (SGD) 是一种流行的算法可以在各种机器上实现最先进的性能学习任务.一些研究人员最近提出了方案并行化 SGD,但都需要性能破坏内存锁定和同步.这项工作旨在展示使用新颖的理论SGD可以实现的分析、算法和实现没有任何锁定.我们提出了一个名为 HOGWILD 的更新方案!哪一个允许处理器访问共享内存的可能性覆盖彼此的工作.我们表明,当关联优化问题是稀疏的,这意味着大多数梯度更新仅修改决策变量的小部分,然后HOGWILD!达到了接近最优的收敛速度.我们通过实验证明霍格维尔德!优于使用订单锁定的替代方案量级.

ABSTRACT: Stochastic Gradient Descent (SGD) is a popular algorithm that can achieve stateof-the-art performance on a variety of machine learning tasks. Several researchers have recently proposed schemes to parallelize SGD, but all require performancedestroying memory locking and synchronization. This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be implemented without any locking. We present an update scheme called HOGWILD! which allows processors access to shared memory with the possibility of overwriting each other’s work. We show that when the associated optimization problem is sparse, meaning most gradient updates only modify small parts of the decision variable, then HOGWILD! achieves a nearly optimal rate of convergence. We demonstrate experimentally that HOGWILD! outperforms alternative schemes that use locking by an order of magnitude.

大规模分布式深度网络"

"Large Scale Distributed Deep Networks"

作者:Jeffrey Dean、Greg Corrado、Rajat Monga、Kai Chen、Matthieu Devin、Mark Mao、Marc'aurelio Ranzato、Andrew Senior、Paul Tucker、Ke Yang、Quoc Le、Andrew Ng

by Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc'aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, Quoc Le, Andrew Ng

https://papers.nips.cc/paper/2012/hash/6aca97005c68f1206823815f66102863-Abstract.html

最近在无监督特征学习和深度学习方面的工作有表明能够训练大型模型可以显着提高表现.在本文中,我们考虑训练一个深度的问题使用数万个 CPU 拥有数十亿个参数的网络核心.我们开发了一个名为 DistBelief 的软件框架可以利用具有数千台机器的计算集群进行训练大型模型.在这个框架内,我们开发了两种算法对于大规模分布式训练:(i) 倾盆大雨 SGD,一个支持大数据的异步随机梯度下降程序模型副本的数量,以及 (ii) Sandblaster,一个框架支持多种分布式批处理优化程序,包括 L-BFGS 的分布式实现.倾盆大雨新元和Sandblaster L-BFGS 增加了深度网络的规模和速度训练.我们已经成功地使用我们的系统训练了一个深度网络比以前文献中报道的大 100 倍,并实现了ImageNet 上最先进的性能,视觉对象识别具有 1600 万张图像和 21k 类别的任务.我们证明这些相同的技术极大地加速了更多人的训练用于商业语音识别的中等规模的深度网络服务.尽管我们关注并报告这些方法的性能当应用于训练大型神经网络时,底层算法适用于任何基于梯度的机器学习算法.

Recent work in unsupervised feature learning and deep learning has shown that being able to train large models can dramatically improve performance. In this paper, we consider the problem of training a deep network with billions of parameters using tens of thousands of CPU cores. We have developed a software framework called DistBelief that can utilize computing clusters with thousands of machines to train large models. Within this framework, we have developed two algorithms for large-scale distributed training: (i) Downpour SGD, an asynchronous stochastic gradient descent procedure supporting a large number of model replicas, and (ii) Sandblaster, a framework that supports for a variety of distributed batch optimization procedures, including a distributed implementation of L-BFGS. Downpour SGD and Sandblaster L-BFGS both increase the scale and speed of deep network training. We have successfully used our system to train a deep network 100x larger than previously reported in the literature, and achieves state-of-the-art performance on ImageNet, a visual object recognition task with 16 million images and 21k categories. We show that these same techniques dramatically accelerate the training of a more modestly sized deep network for a commercial speech recognition service. Although we focus on and report performance of these methods as applied to training large neural networks, the underlying algorithms are applicable to any gradient-based machine learning algorithm.

这篇关于并行化依赖随机梯度下降的 ML 模型?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆