Julia中的并行梯度计算 [英] Parallelising gradient calculation in Julia

查看:112
本文介绍了Julia中的并行梯度计算的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

前段时间,我被说服放弃舒适的Matlab编程,并开始使用Julia编程.我从事神经网络工作已经很长时间了,我认为,现在有了Julia,我可以通过并行化梯度计算来更快地完成工作.

I was persuaded some time ago to drop my comfortable matlab programming and start programming in Julia. I have been working for a long with neural networks and I thought that, now with Julia, I could get things done faster by parallelising the calculation of the gradient.

无需一口气就整个数据集计算梯度;而是可以拆分计算.例如,通过将数据集划分为多个部分,我们可以计算每个部分的局部梯度.然后,通过将部分梯度相加来计算总梯度.

The gradient need not be calculated on the entire dataset in one go; instead one can split the calculation. For instance, by splitting the dataset in parts, we can calculate a partial gradient on each part. The total gradient is then calculated by adding up the partial gradients.

尽管原理很简单,但是当我与Julia并行化时,性能会下降,即一个进程快于两个进程!我显然做错了事...我已经咨询了论坛中提出的其他问题,但是我仍然无法拼凑出答案.我认为我的问题在于,正在传输大量不必要的数据,但是我无法正确地对其进行修复.

Though, the principle is simple, when I parallelise with Julia I get a performance degradation, i.e. one process is faster then two processes! I am obviously doing something wrong... I have consulted other questions asked in the forum but I could still not piece together an answer. I think my problem lies in that there is a lot of unnecessary data moving going on, but I can't fix it properly.

为了避免发布混乱的神经网络代码,我在下面的一个简单示例中发布,该示例在线性回归设置中复制了我的问题.

In order to avoid posting messy neural network code, I am posting below a simpler example that replicates my problem in the setting of linear regression.

下面的代码块为线性回归问题创建了一些数据.该代码解释了常量,但是 X 是包含数据输入的矩阵.我们随机创建一个权重向量 w ,将其与 X 相乘会创建一些目标 Y .

The code-block below creates some data for a linear regression problem. The code explains the constants, but X is the matrix containing the data inputs. We randomly create a weight vector w which when multiplied with X creates some targets Y.

######################################
## CREATE LINEAR REGRESSION PROBLEM ##
######################################

# This code implements a simple linear regression problem

MAXITER = 100   # number of iterations for simple gradient descent
N = 10000       # number of data items
D = 50          # dimension of data items
X = randn(N, D) # create random matrix of data, data items appear row-wise
Wtrue = randn(D,1) # create arbitrary weight matrix to generate targets
Y = X*Wtrue     # generate targets

下面的下一个代码块定义用于测量回归的适应度(即负对数似然性)和权重向量 w:

The next code-block below defines functions for measuring the fitness of our regression (i.e. the negative log-likelihood) and the gradient of the weight vector w:

####################################
##       DEFINE FUNCTIONS         ##
####################################

@everywhere  begin

  #-------------------------------------------------------------------
  function negative_loglikelihood(Y,X,W)
  #-------------------------------------------------------------------

    # number of data items
    N  = size(X,1)
    # accumulate here log-likelihood
    ll = 0
    for nn=1:N
      ll = ll - 0.5*sum((Y[nn,:] - X[nn,:]*W).^2)
    end

    return ll
  end


  #-------------------------------------------------------------------
  function negative_loglikelihood_grad(Y,X,W, first_index,last_index)
  #-------------------------------------------------------------------

    # number of data items
    N  = size(X,1)
    # accumulate here gradient contributions by each data item
    grad = zeros(similar(W))
    for nn=first_index:last_index
      grad = grad +  X[nn,:]' * (Y[nn,:] - X[nn,:]*W)
    end

    return grad
  end


end

请注意,以上功能是故意不进行向量化的!我选择不进行向量化,因为最终代码(神经网络的情况)也将不允许任何向量化(让我们不要对此进行更多了解).

Note that the above functions are on purpose not vectorised! I choose not to vectorise, as the final code (the neural network case) will also not admit any vectorisation (let us not get into more details regarding this).

最后,下面的代码块显示了一个非常简单的梯度下降,它试图从给定数据 Y X :

Finally, the code-block below shows a very simple gradient descent that tries to recover the parameter weight vector w from the given data Y and X:

####################################
##     SOLVE LINEAR REGRESSION    ##
####################################


# start from random initial solution
W = randn(D,1)

# learning rate, set here to some arbitrary small constant
eta = 0.000001

# the following for-loop implements simple gradient descent
for iter=1:MAXITER

  # get gradient
  ref_array = Array(RemoteRef, nworkers())

  # let each worker process part of matrix X
  for index=1:length(workers())

    # first index of subset of X that worker should work on
    first_index       = (index-1)*int(ceil(N/nworkers())) + 1
    # last index of subset of X that worker should work on
    last_index        = min((index)*(int(ceil(N/nworkers()))), N)

    ref_array[index] = @spawn negative_loglikelihood_grad(Y,X,W, first_index,last_index)
  end

  # gather the gradients calculated on parts of matrix X
  grad = zeros(similar(W))
  for index=1:length(workers())
    grad = grad + fetch(ref_array[index])
  end

  # now that we have the gradient we can update parameters W
  W = W + eta*grad;

  # report progress, monitor optimisation
  @printf("Iter %d neg_loglikel=%.4f\n",iter, negative_loglikelihood(Y,X,W))
end

正如希望可见的那样,我试图在此处以最简单的方式并行化梯度的计算.我的策略是将梯度的计算分解为尽可能多的可用工人.每个工作人员只需要在矩阵X的一部分上工作,该部分由 first_index last_index 指定.因此,每个工作人员都应使用X[first_index:last_index,:].例如,对于4个工人且N = 10000,工作应划分如下:

As is hopefully visible, I tried to parallelise the calculation of the gradient in the easiest possible way here. My strategy is to break the calculation of the gradient in as many parts as available workers. Each worker is required to work only on part of matrix X, which part is specified by first_index and last_index. Hence, each worker should work with X[first_index:last_index,:]. For instance, for 4 workers and N = 10000, the work should be divided as follows:

  • 工人1 => first_index = 1,last_index = 2500
  • 工人2 => first_index = 2501,last_index = 5000
  • 工人3 => first_index = 5001,last_index = 7500
  • 工人4 => first_index = 7501,last_index = 10000

不幸的是,如果我只有一个工作人员,那么整个代码的工作速度会更快.如果通过addprocs()添加更多工作程序,则代码运行速度会变慢.可以通过创建更多数据项来加剧此问题,例如,改为使用 N = 20000 . 数据项越多,降级就越明显. 在具有 N = 20000 和一个内核的特定计算环境中,代码运行时间约为9秒.使用 N = 20000 和4个内核,大约需要18秒!

Unfortunately, this entire code works faster if I have only one worker. If add more workers via addprocs(), the code runs slower. One can aggravate this issue by create more data items, for instance use instead N=20000. With more data items, the degradation is even more pronounced. In my particular computing environment with N=20000 and one core, the code runs in ~9 secs. With N=20000 and 4 cores it takes ~18 secs!

在这个论坛中,我尝试了许多不同的事情,这些都是受问题和答案启发的,但不幸的是没有结果.我意识到并行化是幼稚的,并且数据移动必定是问题,但是我不知道如何正确执行.在这个问题上,文档似乎也很少(如Ivo Balbaert的好书).

I tried many many different things inspired by the questions and answers in this forum but unfortunately to no avail. I realise that the parallelisation is naive and that data movement must be the problem, but I have no idea how to do it properly. It seems that the documentation is also a bit scarce on this issue (as is the nice book by Ivo Balbaert).

非常感谢您的帮助,因为我已经为此苦苦挣扎了很长时间,我在工作中确实需要它.对于任何想要运行代码的人,为您省去了粘贴粘贴的麻烦,您可以获取代码此处.

I would appreciate your help as I have been stuck for quite some while with this and I really need it for my work. For anyone wanting to run the code, to save you the trouble of copying-pasting you can get the code here.

感谢您抽出宝贵的时间来阅读这个冗长的问题!帮我把它变成一个模型答案,Julia新手可以咨询!

Thanks for taking the time to read this very lengthy question! Help me turn this into a model answer that anyone new in Julia can then consult!

推荐答案

我会说GD不是使用以下任何建议的方法将其并行化的好方法:SharedArrayDistributedArray,或者自己实现数据块的分布.

I would say that GD is not a good candidate for parallelizing it using any of the proposed methods: either SharedArray or DistributedArray, or own implementation of distribution of chunks of data.

问题不在于Julia,而在于GD算法. 考虑代码:

The problem does not lay in Julia, but in the GD algorithm. Consider the code:

主要过程:

for iter = 1:iterations #iterations: "the more the better"
    δ = _gradient_descent_shared(X, y, θ) 
    θ = θ -  α * (δ/N)   
end

问题出在上面的for循环中,这是必须的.不管_gradient_descent_shared有多好,迭代的总数都会破坏并行化的崇高概念.

The problem is in the above for-loop which is a must. No matter how good _gradient_descent_shared is, the total number of iterations kills the noble concept of the parallelization.

在阅读了问题和以上建议之后,我开始使用SharedArray实现GD.请注意,我不是SharedArrays领域的专家.

After reading the question and the above suggestion I've started implementing GD using SharedArray. Please note, I'm not an expert in the field of SharedArrays.

主要流程部分(无需正则化的简单实现):

The main process parts (simple implementation without regularization):

run_gradient_descent(X::SharedArray, y::SharedArray, θ::SharedArray, α, iterations) = begin
  N = length(y)

  for iter = 1:iterations 
    δ = _gradient_descent_shared(X, y, θ) 
    θ = θ -  α * (δ/N)   
  end     
  θ
end

_gradient_descent_shared(X::SharedArray, y::SharedArray, θ::SharedArray, op=(+)) = begin            
    if size(X,1) <= length(procs(X))
        return _gradient_descent_serial(X, y, θ)
    else
        rrefs = map(p -> (@spawnat p _gradient_descent_serial(X, y, θ)), procs(X))
        return mapreduce(r -> fetch(r), op, rrefs)
    end 
end

所有工作人员共有的代码:

The code common to all workers:

#= Returns the range of indices of a chunk for every worker on which it can work. 
The function splits data examples (N rows into chunks), 
not the parts of the particular example (features dimensionality remains intact).=# 
@everywhere function _worker_range(S::SharedArray)
    idx = indexpids(S)
    if idx == 0
        return 1:size(S,1), 1:size(S,2)
    end
    nchunks = length(procs(S))
    splits = [round(Int, s) for s in linspace(0,size(S,1),nchunks+1)]
    splits[idx]+1:splits[idx+1], 1:size(S,2)
end

#Computations on the chunk of the all data.
@everywhere _gradient_descent_serial(X::SharedArray, y::SharedArray, θ::SharedArray)  = begin
    prange = _worker_range(X)

    pX = sdata(X[prange[1], prange[2]])
    py = sdata(y[prange[1],:])

    tempδ = pX' * (pX * sdata(θ) .- py)         
end

数据加载和训练.让我假设我们有:

The data loading and training. Let me assume that we have:

  • X :: Array中大小为(N,D)的特征,其中N-示例数,特征的D维
  • y :: Array中大小为(N,1)的标签

主要代码如下:

X=[ones(size(X,1)) X] #adding the artificial coordinate 
N, D = size(X)
MAXITER = 500
α = 0.01

initialθ = SharedArray(Float64, (D,1))
sX = convert(SharedArray, X)
sy = convert(SharedArray, y)  
X = nothing
y = nothing
gc() 

finalθ = run_gradient_descent(sX, sy, initialθ, α, MAXITER);

在实现并运行之后(在我的Intell Clore i7的8核上),我在多类训练课程(19个类)训练数据上比串行GD(1核)有了非常小的加速(串行GD为715秒/共享GD需要665秒).

After implementing this and run (on 8-cores of my Intell Clore i7) I got a very slight acceleration over serial GD (1-core) on my training multiclass (19 classes) training data (715 sec for serial GD / 665 sec for shared GD).

如果我的实现是正确的(请检查一下-我指望这一点),那么GD算法的并行化就不值得了.绝对可以在1核上使用随机GD来获得更好的加速效果.

If my implementation is correct (please check this out - I'm counting on that) then parallelization of the GD algorithm is not worth of that. Definitely you might get better acceleration using stochastic GD on 1-core.

这篇关于Julia中的并行梯度计算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆