Cython:使prange并行化线程安全 [英] Cython: make prange parallelization thread-safe

查看:108
本文介绍了Cython:使prange并行化线程安全的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

此处的Cython入门版.我试图通过使用多个线程来加快某些成对统计量(在多个箱中)的计算.特别是,我使用的是来自cython.parallel的prange,内部使用openMP.

Cython starter here. I am trying to speed up a calculation of a certain pairwise statistic (in several bins) by using multiple threads. In particular, I am using prange from cython.parallel, which internally uses openMP.

下面的最小示例说明了该问题(通过Jupyter笔记本电脑Cython magic进行编译).

The following minimal example illustrates the problem (compilation via Jupyter notebook Cython magic).

笔记本设置:

%load_ext Cython
import numpy as np

Cython代码:

%%cython --compile-args=-fopenmp --link-args=-fopenmp -a

from cython cimport boundscheck
import numpy as np
from cython.parallel cimport prange, parallel

@boundscheck(False)
def my_parallel_statistic(double[:] X, double[:,::1] bins, int num_threads):

    cdef: 
        int N = X.shape[0]
        int nbins = bins.shape[0]
        double Xij,Yij
        double[:] Z = np.zeros(nbins,dtype=np.float64)
        int i,j,b

    with nogil, parallel(num_threads=num_threads):
        for i in prange(N,schedule='static',chunksize=1):
            for j in range(i):
                #some pairwise quantities
                Xij = X[i]-X[j]
                Yij = 0.5*(X[i]+X[j])
                #check if in bin
                for b in range(nbins):
                    if (Xij < bins[b,0]) or (Xij > bins[b,1]):
                        continue
                    Z[b] += Xij*Yij

    return np.asarray(Z)

模拟数据和垃圾箱

X = np.random.rand(10000)
bin_edges = np.linspace(0.,1,11)
bins = np.array([bin_edges[:-1],bin_edges[1:]]).T
bins = bins.copy(order='C')

通过

%timeit my_parallel_statistic(X,bins,1)
%timeit my_parallel_statistic(X,bins,4)

收益

1 loop, best of 3: 728 ms per loop
1 loop, best of 3: 330 ms per loop

这不是一个完美的缩放比例,但这不是问题的重点. (但是,除了添加通常的修饰符或微调prange参数之外,还有其他建议吗?)

which is not a perfect scaling, but that is not the main point of the question. (But do let me know if you have suggestions beyond adding the usual decorators or fine-tuning the prange arguments.)

但是,此计算显然不是线程安全的:

However, this calculation is apparently not thread-safe:

Z1 = my_parallel_statistic(X,bins,1)
Z4 = my_parallel_statistic(X,bins,4)
np.allclose(Z1,Z4)

揭示了两个结果之间的显着差异(在此示例中,差异高达20%).

reveals a significant difference between the two results (up to 20% in this example).

我强烈怀疑问题在于多个线程可以做到

I strongly suspect that the problem is that multiple threads can do

Z[b] += Xij*Yij

同时.但是我不知道如何在不牺牲速度的情况下解决这个问题.

at the same time. But what I don't know is how to fix this without sacrificing the speed-up.

在我的实际用例中,Xij和Yij的计算更为昂贵,因此我希望每对仅计算一次.此外,预先计算并存储所有对的Xij和Yij,然后简单地循环访问bins也不是一个好选择,因为N会变得非常大,而我无法在内存中存储100,000 x 100,000 numpy数组(这实际上是用Cython重写它的主要动机!).

In my actual use case, the calculation of Xij and Yij is more expensive, hence I would like to do them only once per pair. Also, pre-computing and storing Xij and Yij for all pairs and then simply looping through bins is not a good option either because N can get very large, and I can't store 100,000 x 100,000 numpy arrays in memory (this was actually the main motivation for rewriting it in Cython!).

系统信息(在注释中加入以下建议):

System info (added following suggestion in comments):

CPU(s): 8
Model name: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
OS: Red Hat Linux v6.8
Memory: 16 GB

推荐答案

是的,Z[b] += Xij*Yij确实是竞争条件.

Yes, Z[b] += Xij*Yij is indeed a race condition.

制作此atomiccritical有两种选择.除了Cython的实现问题之外,由于共享Z向量上的错误共享,在任何情况下您都将具有较差的性能.

There are a couple of options of making this atomic or critical. Implementation issues with Cython aside, you would in any case have bad performance due to false sharing on the shared Z vector.

所以更好的选择是为每个线程保留一个私有数组.再次有几个(非)选项.可以使用专用的malloc指针,但是我想坚持使用np.不能将内存片分配为私有变量.二维(num_threads, nbins)数组可以工作,但是由于某种原因会生成非常复杂且效率低下的数组索引代码.此方法有效,但速度较慢,并且无法缩放.

So the better alternative is to reserve a private array for each thread. There are a couple of (non-)options again. One could use a private malloc'd pointer, but I wanted to stick with np. Memory slices cannot be assigned as private variables. A two dimensional (num_threads, nbins) array works, but for some reason generates very complicated inefficient array index code. This works but is slower and does not scale.

带有手动"2D"索引的平面numpy数组效果很好.通过避免将数组的私有部分填充到64字节(这是典型的缓存行大小),可以得到一些额外的性能.这避免了核心之间的错误共享.私有部分只是简单地在并行区域之外按顺序求和.

A flat numpy array with manual "2D" indexing works well. You get a little bit extra performance by avoiding padding the private parts of the array to 64 byte, which is a typical cache line size. This avoids false sharing between the cores. The private parts are simply summed up serially outside of the parallel region.

%%cython --compile-args=-fopenmp --link-args=-fopenmp -a
from cython cimport boundscheck
import numpy as np
from cython.parallel cimport prange, parallel
cimport openmp

@boundscheck(False)
def my_parallel_statistic(double[:] X, double[:,::1] bins, int num_threads):

    cdef: 
        int N = X.shape[0]
        int nbins = bins.shape[0]
        double Xij,Yij
        # pad local data to 64 byte avoid false sharing of cache-lines
        int nbins_padded = (((nbins - 1) // 8) + 1) * 8
        double[:] Z_local = np.zeros(nbins_padded * num_threads,dtype=np.float64)
        double[:] Z = np.zeros(nbins)
        int i,j,b, bb, tid

    with nogil, parallel(num_threads=num_threads):
        tid = openmp.omp_get_thread_num()
        for i in prange(N,schedule='static',chunksize=1):
            for j in range(i):
                #some pairwise quantities
                Xij = X[i]-X[j]
                Yij = 0.5*(X[i]+X[j])
                #check if in bin
                for b in range(nbins):
                    if (Xij < bins[b,0]) or (Xij > bins[b,1]):
                        continue
                    Z_local[tid * nbins_padded + b] += Xij*Yij
    for tid in range(num_threads):
        for bb in range(nbins):
            Z[bb] += Z_local[tid * nbins_padded + bb]


    return np.asarray(Z)

在我的4核计算机上,720 ms/191 ms的性能非常好,速度为3.6.剩余的间隙可能是由于涡轮模式.我现在无法使用适当的机器进行测试.

This performs quite well on my 4 core machine, with 720 ms / 191 ms, a speedup of 3.6. The remaining gap may be due to turbo mode. I don't have access to a proper machine for testing right now.

这篇关于Cython:使prange并行化线程安全的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆