cython.parallel无法看到速度差异 [英] cython.parallel cannot see the difference in speed

查看:144
本文介绍了cython.parallel无法看到速度差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试使用cython.parallel prange.我只能看到使用了50%的两个内核.我如何利用所有核心.即将循环发送到内核,同时共享阵列,卷和mc_vol?

I tried to use cython.parallel prange. I can only see two cores 50% being used. How can I make use of all the cores. i.e. send the loops to the cores simultaneously sharing the arrays, volume and mc_vol?

我还编辑了纯顺序的for循环,比cython.parallel prange版本快约30秒.他们两个都只使用一个内核.有没有办法使这个并行化.

I also edited purely sequential for-loop which is about 30 seconds faster than than cython.parallel prange version. Both of them are using one core only. Is there are way to parallelize this.

cimport cython
from cython.parallel import prange, parallel, threadid
from libc.stdio cimport sprintf
from libc.stdlib cimport malloc, free
cimport numpy as np

@cython.boundscheck(False)
@cython.wraparound(False)
cpdef MC_Surface(np.ndarray[np.int_t,ndim=3] volume, np.ndarray[np.float32_t,ndim=3] mc_vol):
     cdef int vol_len=len(volume)-1
     cdef int k, j, i
     cdef char* pattern # a string pointer - allocate later
     Perm_area = {
            "00000000": 0.000000,
            ...
            "00011101": 1.515500
        }

         try:
         pattern = <char*>malloc(sizeof(char)*260)
         for k in range(vol_len):
             for j in range(vol_len):
                for i in range(vol_len):
                    sprintf(pattern, "%i%i%i%i%i%i%i%i",
                            volume[i, j, k],
                            volume[i, j + 1, k],
                            volume[i + 1, j, k],
                            volume[i + 1, j + 1, k],
                            volume[i, j, k + 1],
                            volume[i, j + 1, k + 1],
                            volume[i + 1, j, k + 1],
                            volume[i + 1, j + 1, k + 1]);

                    mc_vol[i, j, k] = Perm_area[pattern]
                # if Perm_area[pattern] > 0:
            #    print pattern, 'Area: ', Perm_area[pattern]
            #total_area += Perm_area[pattern]
    finally:
        free(pattern)
return mc_vol

按照DavidW的建议进行编辑,但prange的速度要慢得多:

EDIT following DavidW's suggestion, but prange is considerably slower:

 cpdef MC_Surface(np.ndarray[np.int_t,ndim=3] volume, np.ndarray[np.float32_t,ndim=3] mc_vol):
     cdef int vol_len=len(volume)-1
     cdef int k, j, i
     cdef char* pattern # a string pointer - allocate later
     Perm_area = {
            "00000000": 0.000000,
            ...
            "00011101": 1.515500
        }

        with nogil,parallel():
           try:
             pattern = <char*>malloc(sizeof(char)*260)
             for k in prange(vol_len):
                 for j in range(vol_len):
                    for i in range(vol_len):
                        sprintf(pattern, "%i%i%i%i%i%i%i%i",
                                volume[i, j, k],
                                volume[i, j + 1, k],
                                volume[i + 1, j, k],
                                volume[i + 1, j + 1, k],
                                volume[i, j, k + 1],
                                volume[i, j + 1, k + 1],
                                volume[i + 1, j, k + 1],
                                volume[i + 1, j + 1, k + 1]);
                        with gil:
                            mc_vol[i, j, k] = Perm_area[pattern]
                            # if Perm_area[pattern] > 0:
                            #    print pattern, 'Area: ', Perm_area[pattern]
                            #    total_area += Perm_area[pattern]
           finally:
               free(pattern)

        return mc_vol

我的设置文件如下:

setup(
    name='SurfaceArea',
    ext_modules=[
        Extension('c_marchSurf', ['c_marchSurf.pyx'], include_dirs=[numpy.get_include()],
                  extra_compile_args=['-fopenmp'], extra_link_args=['-fopenmp'], language="c++")
    ],
    cmdclass={'build_ext': build_ext}, requires=['Cython', 'numpy', 'matplotlib', 'pathos', 'scipy', 'cython.parallel']
)

推荐答案

问题是带有gil:的 ,它定义了一个块,该块只能一次在一个内核上运行.您不会在循环内做任何其他事情,因此您不应该真正期望任何提速.

The problem is the with gil:, which defines a block which can only run on one core at once. You aren't doing anything else inside the loop so you shouldn't really expect any speed-up.

为了避免使用GIL,您需要尽可能避免使用Python功能.通过使用c sprintf 创建字符串,可以避免在字符串格式化部分使用它.对于字典查找部分,最简单的方法可能是使用C ++标准库,该库包含具有类似行为的 map 类.(请注意,您现在需要使用Cython的C ++模式对其进行编译)

In order to avoid using the GIL you need to avoid using Python features where possible. You avoid it in the string formatting part by using c sprintf to create your string. For the dictionary lookup part, the easiest thing is probably to use the C++ standard library, which contains a map class with similar behaviour. (Note that you'll now need to compile it with Cython's C++ mode)

# at the top of your file
from libc.stdio cimport sprintf
from libc.stdlib cimport malloc, free
from libcpp.map cimport map
from libcpp.string cimport string
import numpy as np
cimport numpy as np

# ... code omitted  ....
cpdef MC_Surface(np.ndarray[np.int_t,ndim=3] volume, np.ndarray[np.float32_t,ndim=3] mc_vol):
    # note above I've defined volume as a numpy array so that
    # I can do fast, GIL-less direct array lookup
    cdef char* pattern # a string pointer - allocate later

    Perm_area = {} # some dictionary, as before

    # depending on the size of Perm_area, this conversion to
    # a C++ object is potentially quite slow (it involves a lot
    # of string copies)
    cdef map[string,float] Perm_area_m = Perm_area

    # ... code omitted ...
    with nogil,parallel():
       try:
         # assigning pattern here makes it thread local
         # it's assigned once per thread which isn't too bad
         pattern = <char*>malloc(sizeof(char)*50)
         # when you allocate pattern you need to make it big enough
         # either by calculating a size, or by just making it overly big

         # ... more code omitted...
           # then later, inside your loops
           sprintf(pattern, "%i%i%i%i%i%i%i%i", volume[i, j, k],
                        volume[i, j + 1, k],
                        volume[i + 1, j, k],
                        volume[i + 1, j + 1, k],
                        volume[i, j, k + 1],
                        volume[i, j + 1, k + 1],
                        volume[i + 1, j, k + 1],
                        volume[i + 1, j + 1, k + 1]);
           # and now do the dictionary lookup without the GIL
           # because we're using the C++ class instead.
           # Unfortunately, we also need to do a string copy (which might slow things down)
           mc_vol[i, j, k] = Perm_area_m[string(pattern)]
           # be aware that this can throw an exception if the
           # pattern does not match (same as Python).
       finally:
         free(pattern)

我还必须将volume更改为一个numpy数组,因为如果它只是一个Python对象,则需要GIL对其元素进行索引.

I've also had to change volume to being a numpy array, since if it were just a Python object I'd need the GIL to index its elements.

(编辑:也已更改为通过使用C ++映射从GIL块中删除字典)

(Edit: changed to take the dictionary lookup out of the GIL block too by using C++ map)

这篇关于cython.parallel无法看到速度差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆