使用 numpy 数组和共享内存并行化 python 循环 [英] Parallelise python loop with numpy arrays and shared-memory

查看:51
本文介绍了使用 numpy 数组和共享内存并行化 python 循环的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道有关此主题的几个问题和答案,但尚未找到对此特定问题的满意答案:

在其中通过 numpy/scipy 函数操作 numpy 数组的 Python 循环中执行简单共享内存并行化的最简单方法是什么?

我不是在寻找最有效的方法,我只是想要一些简单的实现,当循环不并行运行时不需要大量重写.就像 OpenMP 在低级语言中实现一样.

我在这方面看到的最佳答案是这个,但这是一种相当笨拙的方式,需要一个将循环表达为一个接受单个参数的函数,几行共享数组转换 crud,似乎需要从 __main__ 调用并行函数,它似乎不起作用来自交互式提示(我花了很多时间).

由于 Python 的所有简单性,这真的是并行化循环的最佳方式吗?真的吗?这是在 OpenMP 方式中并行化的微不足道的事情.

我煞费苦心地通读了 multiprocessing 模块的不透明文档,结果发现它是如此通用,以至于它似乎适用于除简单循环并行化之外的所有内容.我对设置管理器、代理、管道等不感兴趣.我只有一个简单的循环,完全并行,任务之间没有任何通信.使用 MPI 来并行处理这样一个简单的情况似乎有点矫枉过正,更不用说在这种情况下内存效率低下.

我没有时间了解 Python 的多种不同的共享内存并行包,但想知道是否有人在这方面有更多经验并且可以向我展示一种更简单的方法.请不要建议串行优化技术,如 Cython(我已经使用它),或使用并行 numpy/scipy 函数,如 BLAS(我的情况更通用,更并行).

解决方案

使用 Cython 并行支持:

# asd.pyx来自 cython.parallel cimport prange将 numpy 导入为 np定义 foo():cdef int i, j, nx = np.zeros((200, 2000), 浮点数)n = x.shape[0]对于我在 prange(n, nogil=True):与吉尔:对于范围内的 j(100):x[i,:] = np.cos(x[i,:])返回 x

在 2 核机器上:

$ cython asd.pyx$ gcc -fPIC -fopenmp -shared -o asd.so asd.c -I/usr/include/python2.7$导出 OMP_NUM_THREADS=1$ time python -c 'import asd;asd.foo()'真正的 0m1.548s用户 0m1.442s系统 0m0.061s$导出 OMP_NUM_THREADS=2$ time python -c 'import asd;asd.foo()'真实 0m0.602s用户 0m0.826s系统 0m0.075s

并行运行良好,因为 np.cos(与其他 ufunc 一样)释放 GIL.

如果您想以交互方式使用它:

# asd.pyxbdldef make_ext(modname, pyxfilename):from distutils.extension 导入扩展返回扩展名(名称= modname,来源=[pyxfilename],extra_link_args=['-fopenmp'],extra_compile_args=['-fopenmp'])

和(先删除asd.soasd.c):

<预><代码>>>>导入pyximport>>>pyximport.install(reload_support=True)>>>导入 asd>>>q1 = asd.foo()# 转到编辑器并更改 asd.pyx>>>重新加载(asd)>>>q2 = asd.foo()

所以是的,在某些情况下,您可以仅使用线程进行并行化.OpenMP 只是一个花哨的线程包装器,因此这里只需要 Cython 以实现更简单的语法.如果没有 Cython,您可以使用 threading 模块 --- 与多处理类似(并且可能更健壮),但您无需执行任何特殊操作即可将数组声明为共享内存.

然而,并不是所有的操作都会释放 GIL,所以性能是 YMMV.

<代码>***

从其他 Stackoverflow 答案中抓取的另一个可能有用的链接——另一个多处理接口:http://packages.python.org/joblib/parallel.html

I am aware of several questions and answers on this topic, but haven't found a satisfactory answer to this particular problem:

What is the easiest way to do a simple shared-memory parallelisation of a python loop where numpy arrays are manipulated through numpy/scipy functions?

I am not looking for the most efficient way, I just wanted something simple to implement that doesn't require a significant rewrite when the loop is not run in parallel. Just like OpenMP implements in lower level languages.

The best answer I've seen in this regard is this one, but this is a rather clunky way that requires one to express the loop into a function that takes a single argument, several lines of shared-array converting crud, seems to require that the parallel function is called from __main__, and it doesn't seem to work well from the interactive prompt (where I spend a lot of my time).

With all of Python's simplicity is this really the best way to parellelise a loop? Really? This is something trivial to parallelise in OpenMP fashion.

I have painstakingly read through the opaque documentation of the multiprocessing module, only to find out that it is so general that it seems suited to everything but a simple loop parallelisation. I am not interested in setting up Managers, Proxies, Pipes, etc. I just have a simple loop, fully parallel that doesn't have any communication between tasks. Using MPI to parallelise such a simple situation seems like overkill, not to mention it would be memory-inefficient in this case.

I haven't had time to learn about the multitude of different shared-memory parallel packages for Python, but was wondering if someone has more experience in this and can show me a simpler way. Please do not suggest serial optimisation techniques such as Cython (I already use it), or using parallel numpy/scipy functions such as BLAS (my case is more general, and more parallel).

解决方案

With Cython parallel support:

# asd.pyx
from cython.parallel cimport prange

import numpy as np

def foo():
    cdef int i, j, n

    x = np.zeros((200, 2000), float)

    n = x.shape[0]
    for i in prange(n, nogil=True):
        with gil:
            for j in range(100):
                x[i,:] = np.cos(x[i,:])

    return x

On a 2-core machine:

$ cython asd.pyx
$ gcc -fPIC -fopenmp -shared -o asd.so asd.c -I/usr/include/python2.7
$ export OMP_NUM_THREADS=1
$ time python -c 'import asd; asd.foo()'
real    0m1.548s
user    0m1.442s
sys 0m0.061s

$ export OMP_NUM_THREADS=2
$ time python -c 'import asd; asd.foo()'
real    0m0.602s
user    0m0.826s
sys 0m0.075s

This runs fine in parallel, since np.cos (like other ufuncs) releases the GIL.

If you want to use this interactively:

# asd.pyxbdl
def make_ext(modname, pyxfilename):
    from distutils.extension import Extension
    return Extension(name=modname,
                     sources=[pyxfilename],
                     extra_link_args=['-fopenmp'],
                     extra_compile_args=['-fopenmp'])

and (remove asd.so and asd.c first):

>>> import pyximport
>>> pyximport.install(reload_support=True)
>>> import asd
>>> q1 = asd.foo()
# Go to an editor and change asd.pyx
>>> reload(asd)
>>> q2 = asd.foo()

So yes, in some cases you can parallelize just by using threads. OpenMP is just a fancy wrapper for threading, and Cython is therefore only needed here for the easier syntax. Without Cython, you can use the threading module --- works similarly as multiprocessing (and probably more robustly), but you don't need to do anything special to declare arrays as shared memory.

However, not all operations release the GIL, so YMMV for the performance.

***

And another possibly useful link scraped from other Stackoverflow answers --- another interface to multiprocessing: http://packages.python.org/joblib/parallel.html

这篇关于使用 numpy 数组和共享内存并行化 python 循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆