threadpoolexecutor与cython的nogil结合使用 [英] Usage of threadpoolexecutor in conjunction with cython's nogil

查看:164
本文介绍了threadpoolexecutor与cython的nogil结合使用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已阅读此问题和答案-使用ThreadPoolExecutor的Cython nogil不提供加速,尽管我的系统具有多个内核,但我的Cython代码也无法获得预期的加速,这也有一个类似的问题.我在Ubuntu 18.04实例上有4个物理核心,如果我在下面的代码中将作业数设为1,则它的运行速度比使它运行时的速度快.4.使用顶部查看CPU使用率,我看到CPU使用率上升到300 %.我正在未修改的C ++类中查找数据结构,即我仅通过Cython对C ++数据结构进行只读查询. C ++方面没有任何互斥锁.

I have read this question and answer -Cython nogil with ThreadPoolExecutor not giving speedups and I have a similar problem with my Cython code not getting the speedup that is expected in spite of my system having multiple cores. I have 4 physical cores on a Ubuntu 18.04 instance and if I make the number of jobs to be 1 in the code below it runs faster than when I make it 4. Looking at the CPU usage using top I see the CPU usage go upto 300 %. I am doing the lookup of a data structure in a C++ class that does not get modified i.e. I am only doing read-only queries on the C++ data structure via Cython. There are no mutex locks whatsoever on the C++ side.

这是我第一次使用GIL,我想知道我是否使用不正确.时间的输出也有点令人困惑,因为我认为它无法正确地描述每个工作线程所花费的实际时间.

This is my first experience with the GIL and I am wondering whether I have incorrectly used it. Also the output of the time is a bit confusing as I do not think it correctly profiles the actual time taken by the each of the worker threads.

我似乎错过了一些关键的东西,但是我无法弄清楚它是什么,因为我已经使用了与链接的SO答案相同的GIL用法模板.

I appear to have missed something crucial but I cannot figure out what it is as I have pretty much used the same template for the usage of the GIL as seen in the linked SO answer.

import psutil
import numpy as np

from concurrent.futures import ThreadPoolExecutor
from functools import partial



cdef extern from "Rectangle.h" namespace "shapes":
cdef cppclass Rectangle:
    Rectangle(int, int, int, int)
    int x0, y0, x1, y1
    int getArea() nogil


cdef class PyRectangle:
     cdef Rectangle *rect 

def __cinit__(self, int x0, int y0, int x1, int y1):
    self.rect = new Rectangle(x0, y0, x1, y1)

def __dealloc__(self):
    del self.rect

def testThread(self):

    latGrid = np.arange(minLat,maxLat,0.05)
    lonGrid = np.arange(minLon,maxLon,0.05)

    gridLon,gridLat = np.meshgrid(latGrid,lonGrid)
    grid_points = np.c_[gridLon.ravel(),gridLat.ravel()]

    n_jobs = psutil.cpu_count(logical=False)

    chunk = np.array_split(grid_points,n_jobs,axis=0)
    x = ThreadPoolExecutor(max_workers=n_jobs) 

    t0 = time.time()
    func = partial(self.performCalc,maxDistance)
    results = x.map(func,chunk)
    results = np.vstack(list(results))
    t1 = time.time()
    print(t1-t0)

def performCalc(self,maxDistance,chunk):

    cdef int area
    cdef double[:,:] gPoints
    gPoints = memoryview(chunk)
    for i in range(0,len(gPoints)):
        with nogil:
            area =  self.getArea2(gPoints[i])
    return area

cdef int getArea2(self,double[:] p) nogil :
    cdef int area
    area = self.rect.getArea()
    return area

推荐答案

我的建议(在注释中)是确保整个performCalc循环为nogil.为此,需要进行一些更改:

My suggestion (in the comments) was to ensure that the entire performCalc loop was nogil. To do this a few changes were needed:

cdef Py_ssize_t i # set type of "i" (although Cython can possibly deduce this anyway)
with nogil:
    for i in range(0,gPoints.shape[0]):
        area =  self.getArea2(gPoints[i])

其中最重要的是将len(gPoints)替换为gPoints.shape[0],从而用数组查找替换了对Python函数的调用(我个人也不认为len对于2D数组有意义).

The most important of which is swapping len(gPoints) for gPoints.shape[0] which replaces a call to a Python function with an array lookup (also I personally don't think len makes sense for a 2D array).

从本质上说,获取和发布GIL会产生成本.您要确保没有GIL的工作值得花时间处理.简单地计算矩形的面积是微不足道的(两个减法和一个乘法),因此并不能真正证明花费在协调线程之间的GIL上的时间-请记住,每个循环一旦执行,每个线程必须(简短地)保持GIL,在此期间时间没有其他线程可以容纳它.但是,以整个循环为nogil来管理它所花费的时间变得很小.

Essentially there's a cost to acquiring and releasing the GIL. You want to make sure that the work done without the GIL is worth the time spent handling it. Simply calculating an area of a rectangle is pretty trivial (two subtractions and a multiplication) and so doesn't really justify the time spent coordinating the GIL between threads - remember that once every loop each thread must (briefly) hold the GIL, during which time no other thread can hold it. However with the whole loop as nogil the time spent on administering it becomes tiny.

这篇关于threadpoolexecutor与cython的nogil结合使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆