Cython:使prange并行化线程安全 [英] Cython: make prange parallelization thread-safe
问题描述
此处的Cython入门版.我试图通过使用多个线程来加快某些成对统计量(在多个箱中)的计算.特别是,我使用的是来自cython.parallel的prange,内部使用openMP.
Cython starter here. I am trying to speed up a calculation of a certain pairwise statistic (in several bins) by using multiple threads. In particular, I am using prange from cython.parallel, which internally uses openMP.
下面的最小示例说明了该问题(通过Jupyter笔记本电脑Cython magic进行编译).
The following minimal example illustrates the problem (compilation via Jupyter notebook Cython magic).
笔记本设置:
%load_ext Cython
import numpy as np
Cython代码:
%%cython --compile-args=-fopenmp --link-args=-fopenmp -a
from cython cimport boundscheck
import numpy as np
from cython.parallel cimport prange, parallel
@boundscheck(False)
def my_parallel_statistic(double[:] X, double[:,::1] bins, int num_threads):
cdef:
int N = X.shape[0]
int nbins = bins.shape[0]
double Xij,Yij
double[:] Z = np.zeros(nbins,dtype=np.float64)
int i,j,b
with nogil, parallel(num_threads=num_threads):
for i in prange(N,schedule='static',chunksize=1):
for j in range(i):
#some pairwise quantities
Xij = X[i]-X[j]
Yij = 0.5*(X[i]+X[j])
#check if in bin
for b in range(nbins):
if (Xij < bins[b,0]) or (Xij > bins[b,1]):
continue
Z[b] += Xij*Yij
return np.asarray(Z)
模拟数据和垃圾箱
X = np.random.rand(10000)
bin_edges = np.linspace(0.,1,11)
bins = np.array([bin_edges[:-1],bin_edges[1:]]).T
bins = bins.copy(order='C')
通过
%timeit my_parallel_statistic(X,bins,1)
%timeit my_parallel_statistic(X,bins,4)
收益
1 loop, best of 3: 728 ms per loop
1 loop, best of 3: 330 ms per loop
这不是一个完美的缩放比例,但这不是问题的重点. (但是,除了添加通常的修饰符或微调prange参数之外,还有其他建议吗?)
which is not a perfect scaling, but that is not the main point of the question. (But do let me know if you have suggestions beyond adding the usual decorators or fine-tuning the prange arguments.)
但是,此计算显然不是线程安全的:
However, this calculation is apparently not thread-safe:
Z1 = my_parallel_statistic(X,bins,1)
Z4 = my_parallel_statistic(X,bins,4)
np.allclose(Z1,Z4)
揭示了两个结果之间的显着差异(在此示例中,差异高达20%).
reveals a significant difference between the two results (up to 20% in this example).
我强烈怀疑问题在于多个线程可以做到
I strongly suspect that the problem is that multiple threads can do
Z[b] += Xij*Yij
同时.但是我不知道如何在不牺牲速度的情况下解决这个问题.
at the same time. But what I don't know is how to fix this without sacrificing the speed-up.
在我的实际用例中,Xij和Yij的计算更为昂贵,因此我希望每对仅计算一次.此外,预先计算并存储所有对的Xij和Yij,然后简单地循环访问bins也不是一个好选择,因为N会变得非常大,而我无法在内存中存储100,000 x 100,000 numpy数组(这实际上是用Cython重写它的主要动机!).
In my actual use case, the calculation of Xij and Yij is more expensive, hence I would like to do them only once per pair. Also, pre-computing and storing Xij and Yij for all pairs and then simply looping through bins is not a good option either because N can get very large, and I can't store 100,000 x 100,000 numpy arrays in memory (this was actually the main motivation for rewriting it in Cython!).
系统信息(在注释中加入以下建议):
System info (added following suggestion in comments):
CPU(s): 8
Model name: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
OS: Red Hat Linux v6.8
Memory: 16 GB
推荐答案
是的,Z[b] += Xij*Yij
确实是竞争条件.
Yes, Z[b] += Xij*Yij
is indeed a race condition.
制作此atomic
或critical
有两种选择.除了Cython的实现问题之外,由于共享Z
向量上的错误共享,在任何情况下您都将具有较差的性能.
There are a couple of options of making this atomic
or critical
. Implementation issues with Cython aside, you would in any case have bad performance due to false sharing on the shared Z
vector.
所以更好的选择是为每个线程保留一个私有数组.再次有几个(非)选项.可以使用专用的malloc
指针,但是我想坚持使用np
.不能将内存片分配为私有变量.二维(num_threads, nbins)
数组可以工作,但是由于某种原因会生成非常复杂且效率低下的数组索引代码.此方法有效,但速度较慢,并且无法缩放.
So the better alternative is to reserve a private array for each thread. There are a couple of (non-)options again. One could use a private malloc
'd pointer, but I wanted to stick with np
. Memory slices cannot be assigned as private variables. A two dimensional (num_threads, nbins)
array works, but for some reason generates very complicated inefficient array index code. This works but is slower and does not scale.
带有手动"2D"索引的平面numpy数组效果很好.通过避免将数组的私有部分填充到64字节(这是典型的缓存行大小),可以得到一些额外的性能.这避免了核心之间的错误共享.私有部分只是简单地在并行区域之外按顺序求和.
A flat numpy array with manual "2D" indexing works well. You get a little bit extra performance by avoiding padding the private parts of the array to 64 byte, which is a typical cache line size. This avoids false sharing between the cores. The private parts are simply summed up serially outside of the parallel region.
%%cython --compile-args=-fopenmp --link-args=-fopenmp -a
from cython cimport boundscheck
import numpy as np
from cython.parallel cimport prange, parallel
cimport openmp
@boundscheck(False)
def my_parallel_statistic(double[:] X, double[:,::1] bins, int num_threads):
cdef:
int N = X.shape[0]
int nbins = bins.shape[0]
double Xij,Yij
# pad local data to 64 byte avoid false sharing of cache-lines
int nbins_padded = (((nbins - 1) // 8) + 1) * 8
double[:] Z_local = np.zeros(nbins_padded * num_threads,dtype=np.float64)
double[:] Z = np.zeros(nbins)
int i,j,b, bb, tid
with nogil, parallel(num_threads=num_threads):
tid = openmp.omp_get_thread_num()
for i in prange(N,schedule='static',chunksize=1):
for j in range(i):
#some pairwise quantities
Xij = X[i]-X[j]
Yij = 0.5*(X[i]+X[j])
#check if in bin
for b in range(nbins):
if (Xij < bins[b,0]) or (Xij > bins[b,1]):
continue
Z_local[tid * nbins_padded + b] += Xij*Yij
for tid in range(num_threads):
for bb in range(nbins):
Z[bb] += Z_local[tid * nbins_padded + bb]
return np.asarray(Z)
在我的4核计算机上,720 ms
/191 ms
的性能非常好,速度为3.6.剩余的间隙可能是由于涡轮模式.我现在无法使用适当的机器进行测试.
This performs quite well on my 4 core machine, with 720 ms
/ 191 ms
, a speedup of 3.6. The remaining gap may be due to turbo mode. I don't have access to a proper machine for testing right now.
这篇关于Cython:使prange并行化线程安全的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!