threadpoolexecutor与cython的nogil结合使用 [英] Usage of threadpoolexecutor in conjunction with cython's nogil
问题描述
我已阅读此问题和答案-使用ThreadPoolExecutor的Cython nogil不提供加速,尽管我的系统具有多个内核,但我的Cython代码也无法获得预期的加速,这也有一个类似的问题.我在Ubuntu 18.04实例上有4个物理核心,如果我在下面的代码中将作业数设为1,则它的运行速度比使它运行时的速度快.4.使用顶部查看CPU使用率,我看到CPU使用率上升到300 %.我正在未修改的C ++类中查找数据结构,即我仅通过Cython对C ++数据结构进行只读查询. C ++方面没有任何互斥锁.
I have read this question and answer -Cython nogil with ThreadPoolExecutor not giving speedups and I have a similar problem with my Cython code not getting the speedup that is expected in spite of my system having multiple cores. I have 4 physical cores on a Ubuntu 18.04 instance and if I make the number of jobs to be 1 in the code below it runs faster than when I make it 4. Looking at the CPU usage using top I see the CPU usage go upto 300 %. I am doing the lookup of a data structure in a C++ class that does not get modified i.e. I am only doing read-only queries on the C++ data structure via Cython. There are no mutex locks whatsoever on the C++ side.
这是我第一次使用GIL,我想知道我是否使用不正确.时间的输出也有点令人困惑,因为我认为它无法正确地描述每个工作线程所花费的实际时间.
This is my first experience with the GIL and I am wondering whether I have incorrectly used it. Also the output of the time is a bit confusing as I do not think it correctly profiles the actual time taken by the each of the worker threads.
我似乎错过了一些关键的东西,但是我无法弄清楚它是什么,因为我已经使用了与链接的SO答案相同的GIL用法模板.
I appear to have missed something crucial but I cannot figure out what it is as I have pretty much used the same template for the usage of the GIL as seen in the linked SO answer.
import psutil
import numpy as np
from concurrent.futures import ThreadPoolExecutor
from functools import partial
cdef extern from "Rectangle.h" namespace "shapes":
cdef cppclass Rectangle:
Rectangle(int, int, int, int)
int x0, y0, x1, y1
int getArea() nogil
cdef class PyRectangle:
cdef Rectangle *rect
def __cinit__(self, int x0, int y0, int x1, int y1):
self.rect = new Rectangle(x0, y0, x1, y1)
def __dealloc__(self):
del self.rect
def testThread(self):
latGrid = np.arange(minLat,maxLat,0.05)
lonGrid = np.arange(minLon,maxLon,0.05)
gridLon,gridLat = np.meshgrid(latGrid,lonGrid)
grid_points = np.c_[gridLon.ravel(),gridLat.ravel()]
n_jobs = psutil.cpu_count(logical=False)
chunk = np.array_split(grid_points,n_jobs,axis=0)
x = ThreadPoolExecutor(max_workers=n_jobs)
t0 = time.time()
func = partial(self.performCalc,maxDistance)
results = x.map(func,chunk)
results = np.vstack(list(results))
t1 = time.time()
print(t1-t0)
def performCalc(self,maxDistance,chunk):
cdef int area
cdef double[:,:] gPoints
gPoints = memoryview(chunk)
for i in range(0,len(gPoints)):
with nogil:
area = self.getArea2(gPoints[i])
return area
cdef int getArea2(self,double[:] p) nogil :
cdef int area
area = self.rect.getArea()
return area
推荐答案
我的建议(在注释中)是确保整个performCalc
循环为nogil
.为此,需要进行一些更改:
My suggestion (in the comments) was to ensure that the entire performCalc
loop was nogil
. To do this a few changes were needed:
cdef Py_ssize_t i # set type of "i" (although Cython can possibly deduce this anyway)
with nogil:
for i in range(0,gPoints.shape[0]):
area = self.getArea2(gPoints[i])
其中最重要的是将len(gPoints)
替换为gPoints.shape[0]
,从而用数组查找替换了对Python函数的调用(我个人也不认为len
对于2D数组有意义).
The most important of which is swapping len(gPoints)
for gPoints.shape[0]
which replaces a call to a Python function with an array lookup (also I personally don't think len
makes sense for a 2D array).
从本质上说,获取和发布GIL会产生成本.您要确保没有GIL的工作值得花时间处理.简单地计算矩形的面积是微不足道的(两个减法和一个乘法),因此并不能真正证明花费在协调线程之间的GIL上的时间-请记住,每个循环一旦执行,每个线程必须(简短地)保持GIL,在此期间时间没有其他线程可以容纳它.但是,以整个循环为nogil
来管理它所花费的时间变得很小.
Essentially there's a cost to acquiring and releasing the GIL. You want to make sure that the work done without the GIL is worth the time spent handling it. Simply calculating an area of a rectangle is pretty trivial (two subtractions and a multiplication) and so doesn't really justify the time spent coordinating the GIL between threads - remember that once every loop each thread must (briefly) hold the GIL, during which time no other thread can hold it. However with the whole loop as nogil
the time spent on administering it becomes tiny.
这篇关于threadpoolexecutor与cython的nogil结合使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!