Cython Nogil与ThreadPoolExecutor不提供加速比 [英] Cython nogil with ThreadPoolExecutor not giving speedups
问题描述
我当时的假设是,如果我使用nogil
指令在Cython中编写代码,那确实会绕过gil,并且我可以使用ThreadPoolExecutor
来使用多个内核.或者,更有可能的是,我弄乱了实现中的某些内容,但似乎无法弄清楚是什么.
I was under the assumption that if I write my code in Cython using the nogil
directive, that would indeed bypass the gil and I could use a ThreadPoolExecutor
to use multiple cores. Or, more likely, I messed up something in the implementation, but I can't seem to figure out what.
我已经使用Barnes-Hut算法编写了一个简单的n体模拟,并且希望并行进行查找:
I've written a simple n-body simulation using the Barnes-Hut algorithm, and would like to do the lookup in parallel:
# cython: boundscheck=False
# cython: wraparound=False
...
def estimate_forces(self, query_point):
...
cdef np.float64_t[:, :] forces
forces = np.zeros_like(query_point, dtype=np.float64)
estimate_forces_multiple(self.root, query_point.data, forces, self.theta)
return np.array(forces, dtype=np.float64)
cdef void estimate_forces_multiple(...) nogil:
for i in range(len(query_points)):
...
estimate_forces(cell, query_point, forces, theta)
我这样调用代码:
data = np.random.uniform(0, 100, (1000000, 2))
executor = ThreadPoolExecutor(max_workers=max_workers)
quad_tree = QuadTree(data)
chunks = np.array_split(data, max_workers)
forces = executor.map(quad_tree.estimate_forces, chunks)
forces = np.vstack(list(forces))
为了使问题代码更清晰,我省略了很多代码.据我了解,增加max_workers
应该使用多个内核并提供实质性的加速,但是,情况似乎并非如此:
I've omitted lots of code in order to make the code in question clearer. It is my understanding that increasing max_workers
should use multiple cores and provide a substantial speedup, however, this does not seem to be the case:
> time python barnes_hut.py --max-workers 1
python barnes_hut.py 9.35s user 0.61s system 106% cpu 9.332 total
> time python barnes_hut.py --max-workers 2
python barnes_hut.py 9.05s user 0.64s system 107% cpu 9.048 total
> time python barnes_hut.py --max-workers 4
python barnes_hut.py 9.08s user 0.64s system 107% cpu 9.035 total
> time python barnes_hut.py --max-workers 8
python barnes_hut.py 9.12s user 0.71s system 108% cpu 9.098 total
构建四叉树的时间少于1s,因此大部分时间都花在了estimate_forces_multiple
上,但是显然,使用多个线程并没有加快速度.观察top
,它似乎也没有使用多个内核.
Building the quad tree takes less than 1s, so the majority of the time is spent on estimate_forces_multiple
, but clearly, I get no speed up using multiple threads. Looking at top
, it doesn't appear to use multiple cores either.
我的猜测是我肯定错过了一些非常重要的事情,但是我真的无法弄清楚是什么.
My guess is that I must have missed something quite crucial, but I can't really figure out what.
推荐答案
我错过了实际上标志着要释放GIL的关键部分:
I was missing a crucial part that actually signaled to release the GIL:
def estimate_forces(self, query_point):
...
cdef np.float64_t[:, :] forces
forces = np.zeros_like(query_point, dtype=np.float64)
# HERE
cdef DTYPE_t[:, :] query_points = query_point.data
with nogil:
estimate_forces_multiple(self.root, query_points, forces, self.theta)
return np.array(forces, dtype=np.float64)
我还发现UNIX time
命令不能满足我对多线程程序的要求,并报告了相同的数字(我想它报告了CPU时间?).使用pythons timeit
提供了预期的结果:
I've also found that the UNIX time
command doesn't do what I wanted for multithreaded programs and reported the same numbers (I guess it reported the CPU time?). Using pythons timeit
provided expected results:
max_workers=1: 91.2366s
max_workers=2: 36.7975s
max_workers=4: 30.1390s
max_workers=8: 24.0240s
这篇关于Cython Nogil与ThreadPoolExecutor不提供加速比的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!