Cython Nogil与ThreadPoolExecutor不提供加速比 [英] Cython nogil with ThreadPoolExecutor not giving speedups

查看:57
本文介绍了Cython Nogil与ThreadPoolExecutor不提供加速比的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我当时的假设是,如果我使用nogil指令在Cython中编写代码,那确实会绕过gil,并且我可以使用ThreadPoolExecutor来使用多个内核.或者,更有可能的是,我弄乱了实现中的某些内容,但似乎无法弄清楚是什么.

I was under the assumption that if I write my code in Cython using the nogil directive, that would indeed bypass the gil and I could use a ThreadPoolExecutor to use multiple cores. Or, more likely, I messed up something in the implementation, but I can't seem to figure out what.

我已经使用Barnes-Hut算法编写了一个简单的n体模拟,并且希望并行进行查找:

I've written a simple n-body simulation using the Barnes-Hut algorithm, and would like to do the lookup in parallel:

# cython: boundscheck=False
# cython: wraparound=False
...

def estimate_forces(self, query_point):
    ...
    cdef np.float64_t[:, :] forces

    forces = np.zeros_like(query_point, dtype=np.float64)
    estimate_forces_multiple(self.root, query_point.data, forces, self.theta)

    return np.array(forces, dtype=np.float64)


cdef void estimate_forces_multiple(...) nogil:
    for i in range(len(query_points)):
        ...
        estimate_forces(cell, query_point, forces, theta)

我这样调用代码:

data = np.random.uniform(0, 100, (1000000, 2))

executor = ThreadPoolExecutor(max_workers=max_workers)

quad_tree = QuadTree(data)

chunks = np.array_split(data, max_workers)
forces = executor.map(quad_tree.estimate_forces, chunks)
forces = np.vstack(list(forces))

为了使问题代码更清晰,我省略了很多代码.据我了解,增加max_workers应该使用多个内核并提供实质性的加速,但是,情况似乎并非如此:

I've omitted lots of code in order to make the code in question clearer. It is my understanding that increasing max_workers should use multiple cores and provide a substantial speedup, however, this does not seem to be the case:

> time python barnes_hut.py --max-workers 1
python barnes_hut.py  9.35s user 0.61s system 106% cpu 9.332 total

> time python barnes_hut.py --max-workers 2
python barnes_hut.py  9.05s user 0.64s system 107% cpu 9.048 total

> time python barnes_hut.py --max-workers 4
python barnes_hut.py  9.08s user 0.64s system 107% cpu 9.035 total

> time python barnes_hut.py --max-workers 8
python barnes_hut.py  9.12s user 0.71s system 108% cpu 9.098 total

构建四叉树的时间少于1s,因此大部分时间都花在了estimate_forces_multiple上,但是显然,使用多个线程并没有加快速度.观察top,它似乎也没有使用多个内核.

Building the quad tree takes less than 1s, so the majority of the time is spent on estimate_forces_multiple, but clearly, I get no speed up using multiple threads. Looking at top, it doesn't appear to use multiple cores either.

我的猜测是我肯定错过了一些非常重要的事情,但是我真的无法弄清楚是什么.

My guess is that I must have missed something quite crucial, but I can't really figure out what.

推荐答案

我错过了实际上标志着要释放GIL的关键部分:

I was missing a crucial part that actually signaled to release the GIL:

def estimate_forces(self, query_point):
    ...
    cdef np.float64_t[:, :] forces

    forces = np.zeros_like(query_point, dtype=np.float64)
    # HERE
    cdef DTYPE_t[:, :] query_points = query_point.data
    with nogil:
        estimate_forces_multiple(self.root, query_points, forces, self.theta)

    return np.array(forces, dtype=np.float64)

我还发现UNIX time命令不能满足我对多线程程序的要求,并报告了相同的数字(我想它报告了CPU时间?).使用pythons timeit提供了预期的结果:

I've also found that the UNIX time command doesn't do what I wanted for multithreaded programs and reported the same numbers (I guess it reported the CPU time?). Using pythons timeit provided expected results:

max_workers=1: 91.2366s
max_workers=2: 36.7975s
max_workers=4: 30.1390s
max_workers=8: 24.0240s

这篇关于Cython Nogil与ThreadPoolExecutor不提供加速比的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆