朱莉娅(Julia):为什么共享内存多线程没有给我提速? [英] Julia: why doesn't shared memory multi-threading give me a speedup?

查看:191
本文介绍了朱莉娅(Julia):为什么共享内存多线程没有给我提速?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在Julia中使用共享内存多线程.正如Threads.@ threads宏所做的那样,我可以使用ccall(:jl_threading_run ...)来执行此操作.尽管我的代码现在可以并行运行,但我没有达到预期的速度.

I want to use shared memory multi-threading in Julia. As done by the Threads.@threads macro, I can use ccall(:jl_threading_run ...) to do this. And whilst my code now runs in parallel, I don't get the speedup I expected.

以下代码旨在作为我所采用的方法和所遇到的性能问题的一个最小示例:

The following code is intended as a minimal example of the approach I'm taking and the performance problem I'm having:

nthreads = Threads.nthreads()
test_size = 1000000
println("STARTED with ", nthreads, " thread(s) and test size of ", test_size, ".")
# Something to be processed:
objects = rand(test_size)
# Somewhere for our results
results = zeros(nthreads)
counts = zeros(nthreads)
# A function to do some work.
function worker_fn()
    work_idx = 1
    my_result = results[Threads.threadid()]
    while work_idx > 0
        my_result += objects[work_idx]
        work_idx += nthreads
        if work_idx > test_size
            break
        end
        counts[Threads.threadid()] += 1
    end
end

# Call our worker function using jl_threading_run
@time ccall(:jl_threading_run, Ref{Cvoid}, (Any,), worker_fn)

# Verify that we made as many calls as we think we did.
println("\nCOUNTS:")
println("\tPer thread:\t", counts)
println("\tSum:\t\t", sum(counts))

在i7-7700上,典型的单线程结果是:

On an i7-7700, a typical single threaded result is:

STARTED with 1 thread(s) and test size of 1000000.
 0.134606 seconds (5.00 M allocations: 76.563 MiB, 1.79% gc time)

COUNTS:
    Per thread:     [999999.0]
    Sum:            999999.0

有4个线程:

STARTED with 4 thread(s) and test size of 1000000.
  0.140378 seconds (1.81 M allocations: 25.661 MiB)

COUNTS:
    Per thread:     [249999.0, 249999.0, 249999.0, 249999.0]
    Sum:            999996.0

多线程会使事情变慢!为什么?

Multi-threading slows things down! Why?

可以在@threads宏本身中创建一个更好的最小示例.

A better minimal example can be created @threads macro itself.

a = zeros(Threads.nthreads())
b = rand(test_size)
calls = zeros(Threads.nthreads())
@time Threads.@threads for i = 1 : test_size
    a[Threads.threadid()] += b[i]
    calls[Threads.threadid()] += 1
end

我错误地认为@threads宏包含在Julia中将意味着有好处.

I falsely assumed that the @threads macro's inclusion in Julia would mean that there was a benefit to be had.

推荐答案

您遇到的问题很可能是错误共享.

The problem you have is most probably false sharing.

您可以通过如下方式解决您的问题:将您写的区域分开得足够远(这是快速而肮脏的"实现,以显示更改的本质):

You can solve it by separating the areas you write to far enough like this (here is a "quick and dirty" implementation to show the essence of the change):

julia> function f(spacing)
           test_size = 1000000
           a = zeros(Threads.nthreads()*spacing)
           b = rand(test_size)
           calls = zeros(Threads.nthreads()*spacing)
           Threads.@threads for i = 1 : test_size
               @inbounds begin
                   a[Threads.threadid()*spacing] += b[i]
                   calls[Threads.threadid()*spacing] += 1
               end
           end
           a, calls
       end
f (generic function with 1 method)

julia> @btime f(1);
  41.525 ms (35 allocations: 7.63 MiB)

julia> @btime f(8);
  2.189 ms (35 allocations: 7.63 MiB)

或在这样的局部变量上进行按线程累加(这是首选方法,因为它应该统一更快):

or doing per-thread accumulation on a local variable like this (this is a preferred approach as it should be uniformly faster):

function getrange(n)
    tid = Threads.threadid()
    nt = Threads.nthreads()
    d , r = divrem(n, nt)
    from = (tid - 1) * d + min(r, tid - 1) + 1
    to = from + d - 1 + (tid ≤ r ? 1 : 0)
    from:to
end

function f()
    test_size = 10^8
    a = zeros(Threads.nthreads())
    b = rand(test_size)
    calls = zeros(Threads.nthreads())
    Threads.@threads for k = 1 : Threads.nthreads()
        local_a = 0.0
        local_c = 0.0
        for i in getrange(test_size)
            for j in 1:10
                local_a += b[i]
                local_c += 1
            end
        end
        a[Threads.threadid()] = local_a
        calls[Threads.threadid()] = local_c
    end
    a, calls
end

还请注意,您可能在一台具有2个物理核心(只有4个虚拟核心)的计算机上使用4个踏步,因此线程化的收益将不是线性的.

Also note that you are probably using 4 treads on a machine with 2 physical cores (and only 4 virtual cores) so the gains from threading will not be linear.

这篇关于朱莉娅(Julia):为什么共享内存多线程没有给我提速?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆