BLAS v.Julia SharedArray对象的并行更新 [英] BLAS v. parallel updates for Julia SharedArray objects

查看:100
本文介绍了BLAS v.Julia SharedArray对象的并行更新的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对将Julia SharedArray s用于科学计算项目感兴趣.我当前的实现对所有矩阵矢量运算都吸引BLAS,但我认为SharedArray可能会在多核计算机上提供一些加速.我的想法是简单地按索引更新输出向量,将索引更新填充到工作进程中.

先前的讨论此处关于SharedArray s和此处关于共享内存对象的内容均未提供明确的指导.从直观上看似乎很简单,但是经过测试,我对这种方法为何如此差的效果感到困惑(请参见下面的代码).对于初学者来说,@parallel for似乎分配了很多内存.如果我在循环前面加上@sync前缀(如果以后需要整个输出向量,这似乎是明智的选择),那么并行循环就会慢很多(尽管没有@sync时,循环会非常快). /p>

我是否错误地解释了SharedArray对象的正确用法?还是我没有有效地分配计算结果?

### test for speed gain w/ SharedArray vs. Array ###

# problem dimensions
n = 10000; p = 25000

# set BLAS threads; 64 seems reasonable in testing
blas_set_num_threads(64)

# make normal Arrays
x = randn(n,p)
y = ones(p)
z = zeros(n)

# make SharedArrays
X = convert(SharedArray{Float64,2}, x)  
Y = convert(SharedArray{Float64,1}, y)  
Z = convert(SharedArray{Float64,1}, z)  

# run BLAS.gemv! on Arrays twice, time second case
BLAS.gemv!('N', 1.0, x, y, 0.0, z)
@time BLAS.gemv!('N', 1.0, x, y, 0.0, z)

# does BLAS work equally well for SharedArrays? 
# check timing result and ensure same answer
BLAS.gemv!('N', 1.0, X, Y, 0.0, Z)
@time BLAS.gemv!('N', 1.0, X, Y, 0.0, Z)
println("$(isequal(z,Z))")  # should be true

# SharedArrays can be updated in parallel
# code a loop to farm updates to worker nodes
# use transposed X to place rows of X in columnar format
# should (hopefully) help with performance issues from stride
Xt = X'  
@parallel for i = 1:n 
    Z[i] = dot(Y, Xt[:,i])
end

# now time the synchronized copy of this
@time @sync @parallel for i = 1:n 
    Z[i] = dot(Y, Xt[:,i])
end

# still get same result?
println("$(isequal(z,Z))")  # should be true

具有4个工作人员+ 1个主节点的测试输出:

elapsed time: 0.109010169 seconds (80 bytes allocated)
elapsed time: 0.110858551 seconds (80 bytes allocated)
true
elapsed time: 1.726231048 seconds (119936 bytes allocated)
true

解决方案

您遇到了几个问题,其中最重要的是Xt[:,i]创建了一个新数组(分配内存).这是一个使您更接近所需内容的演示:

n = 10000; p = 25000

# make normal Arrays
x = randn(n,p)
y = ones(p)
z = zeros(n)

# make SharedArrays
X = convert(SharedArray, x)  
Y = convert(SharedArray, y)  
Z = convert(SharedArray, z)

Xt = X'

@everywhere function dotcol(a, B, j)
    length(a) == size(B,1) || throw(DimensionMismatch("a and B must have the same number of rows"))
    s = 0.0
    @inbounds @simd for i = 1:length(a)
        s += a[i]*B[i,j]
    end
    s
end

function run1!(Z, Y, Xt)
    for j = 1:size(Xt, 2)
        Z[j] = dotcol(Y, Xt, j)
    end
    Z
end

function runp!(Z, Y, Xt)
    @sync @parallel for j = 1:size(Xt, 2)
        Z[j] = dotcol(Y, Xt, j)
    end
    Z
end

run1!(Z, Y, Xt)
runp!(Z, Y, Xt)
@time run1!(Z, Y, Xt)
zc = copy(sdata(Z))
fill!(Z, -1)
@time runp!(Z, Y, Xt)

@show sdata(Z) == zc

结果(启动julia -p 8时):

julia> include("/tmp/paralleldot.jl")
elapsed time: 0.465755791 seconds (80 bytes allocated)
elapsed time: 0.076751406 seconds (282 kB allocated)
sdata(Z) == zc = true

为进行比较,在同一台计算机上运行时:

julia> blas_set_num_threads(8)

julia> @time A_mul_B!(Z, X, Y);
elapsed time: 0.067611858 seconds (80 bytes allocated)

因此,Julia的原始实现至少可以与BLAS竞争.

I am interested in using Julia SharedArrays for a scientific computing project. My current implementation appeals to BLAS for all matrix-vector operations, but I thought that perhaps a SharedArray would offer some speedup on multicore machines. My idea is to simply update an output vector index-by-index, farming the index updates to worker processes.

Previous discussions here about SharedArrays and here about shared memory objects did not offer clear guidance on this issue. It seems intuitively simple enough, but after testing, I'm somewhat confused as to why this approach works so poorly (see code below). For starters, it seems like @parallel for allocates a lot of memory. And if I prefix the loop with @sync, which seems like a smart thing to do if the whole output vector is required later, then the parallel loop is substantially slower (though without @sync, the loop is mighty quick).

Have I incorrectly interpreted the proper use of the SharedArray object? Or perhaps did I inefficiently assign the calculations?

### test for speed gain w/ SharedArray vs. Array ###

# problem dimensions
n = 10000; p = 25000

# set BLAS threads; 64 seems reasonable in testing
blas_set_num_threads(64)

# make normal Arrays
x = randn(n,p)
y = ones(p)
z = zeros(n)

# make SharedArrays
X = convert(SharedArray{Float64,2}, x)  
Y = convert(SharedArray{Float64,1}, y)  
Z = convert(SharedArray{Float64,1}, z)  

# run BLAS.gemv! on Arrays twice, time second case
BLAS.gemv!('N', 1.0, x, y, 0.0, z)
@time BLAS.gemv!('N', 1.0, x, y, 0.0, z)

# does BLAS work equally well for SharedArrays? 
# check timing result and ensure same answer
BLAS.gemv!('N', 1.0, X, Y, 0.0, Z)
@time BLAS.gemv!('N', 1.0, X, Y, 0.0, Z)
println("$(isequal(z,Z))")  # should be true

# SharedArrays can be updated in parallel
# code a loop to farm updates to worker nodes
# use transposed X to place rows of X in columnar format
# should (hopefully) help with performance issues from stride
Xt = X'  
@parallel for i = 1:n 
    Z[i] = dot(Y, Xt[:,i])
end

# now time the synchronized copy of this
@time @sync @parallel for i = 1:n 
    Z[i] = dot(Y, Xt[:,i])
end

# still get same result?
println("$(isequal(z,Z))")  # should be true

Output from test with 4 workers + 1 master node:

elapsed time: 0.109010169 seconds (80 bytes allocated)
elapsed time: 0.110858551 seconds (80 bytes allocated)
true
elapsed time: 1.726231048 seconds (119936 bytes allocated)
true

解决方案

You're running into several issues, of which the most important is that Xt[:,i] creates a new array (allocating memory). Here's a demonstration that gets you closer to what you want:

n = 10000; p = 25000

# make normal Arrays
x = randn(n,p)
y = ones(p)
z = zeros(n)

# make SharedArrays
X = convert(SharedArray, x)  
Y = convert(SharedArray, y)  
Z = convert(SharedArray, z)

Xt = X'

@everywhere function dotcol(a, B, j)
    length(a) == size(B,1) || throw(DimensionMismatch("a and B must have the same number of rows"))
    s = 0.0
    @inbounds @simd for i = 1:length(a)
        s += a[i]*B[i,j]
    end
    s
end

function run1!(Z, Y, Xt)
    for j = 1:size(Xt, 2)
        Z[j] = dotcol(Y, Xt, j)
    end
    Z
end

function runp!(Z, Y, Xt)
    @sync @parallel for j = 1:size(Xt, 2)
        Z[j] = dotcol(Y, Xt, j)
    end
    Z
end

run1!(Z, Y, Xt)
runp!(Z, Y, Xt)
@time run1!(Z, Y, Xt)
zc = copy(sdata(Z))
fill!(Z, -1)
@time runp!(Z, Y, Xt)

@show sdata(Z) == zc

Results (when starting julia -p 8):

julia> include("/tmp/paralleldot.jl")
elapsed time: 0.465755791 seconds (80 bytes allocated)
elapsed time: 0.076751406 seconds (282 kB allocated)
sdata(Z) == zc = true

For comparison, when running on this same machine:

julia> blas_set_num_threads(8)

julia> @time A_mul_B!(Z, X, Y);
elapsed time: 0.067611858 seconds (80 bytes allocated)

So the raw Julia implementation is at least competitive with BLAS.

这篇关于BLAS v.Julia SharedArray对象的并行更新的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆