python中二维数组操作的高效并行化操作 [英] Efficient parallelization of operations on two dimensional array operations in python

查看:200
本文介绍了python中二维数组操作的高效并行化操作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 python 中的 joblib 库并行化二维数组上的操作.这是我的代码

from joblib import 并行,延迟导入多处理将 numpy 导入为 np# 下面的代码只是聚合了base_array,形成了一个新的二维数组base_array = np.ones((2**12, 2**12), dtype=np.uint8)def compute_average(i, j):返回 np.uint8(np.mean(base_array[i*4: (i+1)*4, j*4: (j+1)*4]))num_cores = multiprocessing.cpu_count()new_array = np.array(Parallel(n_jobs=num_cores)(delayed(compute_average)(i, j)对于 i in xrange(0,1024) for j in xrange(0,1024)),dtype=np.uint8)

上面的代码比下面的基本嵌套 for 循环花费更多的时间.

new_array_nested = np.ones((2**10, 2**10), dtype=np.uint8)对于 xrange(0,1024) 中的 i:对于 xrange(0,1024) 中的 j:new_array_nested[i,j] = compute_average(i,j)

为什么并行操作需要更多时间?如何提高上述代码的效率?

解决方案

哇!非常喜欢你的代码.它就像一个魅力将总效率提高了 400 倍.我将尝试阅读有关 numba 和 jit 编译器的更多信息,但您能否简要说明为什么它如此高效.再次感谢大家的帮助!– Ram 18 年 1 月 3 日 20:30

我们可以很容易地到达77 [ms],但是需要掌握几个步骤才能到达那里,所以让我们开始:

<小时>

问:为什么并行操作需要更多时间?

因为提议的 joblib 步骤创建了许多完整的流程副本 - 以便摆脱 GIL 步骤 纯-[SERIAL] 跳舞(一个接一个)但是(!)这包括所有内存的附加成本在开始执行第一步之前,传输所有变量和整个 Python 解释器及其内部状态(对于确实很大的 numpy 数组来说非常昂贵/敏感)对您的有效载荷"计算策略进行有用"工作,
所以
所有这些实例化开销的总和很容易变得大于开销不可知的成反比 1/N 因子的预期
在哪里设置 N ~ num_cores.

有关详细信息,请阅读阿姆达尔定律重新公式化尾部的数学公式在这里.

<小时>

问:可以帮助提高上面代码的效率吗?

尽可能节省所有间接费用:
- 在可能的情况下:
- 在进程生成端,尝试使用 n_jobs = ( num_cores - 1 ) 为主"进程提供更多空间和基准测试如果性能上升
- 在进程终止端,避免从返回值中收集和构造一个新的(可能很大)对象,而是预先分配一个刚好足够大的进程本地数据结构和返回一些高效的、序列化的,以便轻松且非阻塞地合并每个部分返回的结果对齐.

这两个隐藏"成本都是您的主要设计敌人,因为它们会线性添加到计算路径的纯[SERIAL] 部分整个问题的解决方案(参考:这两个在overhead-strict中的效果强> 阿姆达尔定律公式 )

<小时>

实验 &结果:

<预><代码>>>>从 zmq 导入秒表;aClk = 秒表()>>>base_array = np.ones( (2**12, 2**12), dtype = np.uint8 )>>>base_array.flagsC_CONTIGUOUS : 真F_CONTIGUOUS : 假自己的数据:真可写:真对齐:真UPDATEIFCOPY : 错误>>>def compute_average_per_TILE( TILE_i, TILE_j )://天真模式...返回 np.uint8( np.mean( base_array[ 4*TILE_i:4*(TILE_i+1),... 4*TILE_j:4*(TILE_j+1)...]……)……)...>>>aClk.start();_ = compute_average_per_TILE( 12,13 );aClk.stop()2511010210993

每次拍摄大约需要 93 [us].期望 1024*1024*93 ~ 97,517,568 [us] 来覆盖整个 base_array 的均值处理.

通过实验,我们可以很好地看到开销处理不当的影响,天真的嵌套实验进行了:

<预><代码>>>>aClk.start();_ = [compute_average_per_TILE(i, j)对于我在 xrange(1024)对于 x 范围内的 j(1024)];aClk.stop()26310594^^……26310594/1024./1024. == 25.09 [us/cell]

这大约减少了 3.7 倍(由于没有产生尾部"部分(单个返回值的分配)开销 2**20 次,但在终端分配时只有一次.

然而,更多惊喜即将到来.

<小时>

什么是合适的工具?

从来没有一个通用的规则,没有一刀切.

给定
每次调用处理的不仅仅是一个 4x4 矩阵图块(实际上 少于 25 [us] 每个提议的 joblibcode> - 2**20 调用的编排产生,分布在 ~ .cpu_count() 由原始完全实例化的进程上提案

...( joblib.Parallel( n_jobs = num_cores )(joblib.delayed(compute_average)(i,j)对于 i 在 xrange( 1024 )对于 x 范围内的 j(1024))

确实有提升性能的空间.

对于这些小规模矩阵(在这个意义上,并非所有问题都如此令人愉快),人们可以通过更智能的内存访问模式和减少 Python GIL 引发的弱点来获得最佳结果.

由于每次调用的跨度只是 4x4 微型计算,因此更好的方法是利用智能矢量化(所有数据都适合缓存,因此缓存内计算是寻求最高性能的假期之旅)

>

最好的(仍然是非常天真的矢量化代码)
能够从 ~ 25 [us/cell]小于 ~ 74 [ns/cell](还有一个空间用于更好的对齐处理,因为它需要 ~ 4.6 [ns]/一个 base_array 单元格处理 ),所以期待另一个级别的加速,如果在-缓存优化的矢量化代码将得到正确制作.

77 [ms] 中?!值得这样做,不是吗?

不是 97 秒,
不是 25 秒,
但不到 77 [ms] 只需敲几下键盘,如果更好地优化呼号,本可以挤出更多:

<预><代码>>>>进口麻木>>>@numba.jit( nogil = True, nopython = True )... def jit_avg2( base_IN, ret_OUT )://这些数据结构的所有预分配内存... for i in np.arange(1024)://准备好矢量化代码的 numpy 迭代器... for j in np.arange(1024)://vectorised-code ready numpy iterator... ret_OUT[i,j] = np.uint8( np.mean( base_IN[4*i:4*(i+1),... 4*j:4*(j+1)...]……)……)... return//避免终端分配成本...>>>aClk.start();_ = jit_avg2( base_array, mean_array );aClk.stop()1586182(即使有所有 jit 编译马戏团,它也比 GIL-stepped 嵌套 fors 更快......)7693577337

I'm trying to parallelize operations on two dimensional array using joblib library in python. Here is the code I have

from joblib import Parallel, delayed
import multiprocessing
import numpy as np

# The code below just aggregates the base_array to form a new two dimensional array
base_array = np.ones((2**12, 2**12), dtype=np.uint8)
def compute_average(i, j):
    return np.uint8(np.mean(base_array[i*4: (i+1)*4, j*4: (j+1)*4]))

num_cores = multiprocessing.cpu_count()
new_array = np.array(Parallel(n_jobs=num_cores)(delayed(compute_average)(i, j) 
                                        for i in xrange(0,1024) for j in xrange(0,1024)), dtype=np.uint8)

The above code takes more time than the basic nested for loop below.

new_array_nested = np.ones((2**10, 2**10), dtype=np.uint8)
for i in xrange(0,1024):
    for j in xrange(0,1024):
         new_array_nested[i,j] = compute_average(i,j)

Why are parallel operations taking more time? How can the efficiency of the above code be improved?

解决方案

Wow! Absolutely loved your code. It worked like a charm improving the total efficiency by 400x. I'll try to read more about numba and jit compilers, but can you write briefly of why it is so efficient. Thanks once again for all the help! – Ram Jan 3 '18 at 20:30

We can quite easily get somewhere under 77 [ms], but it takes mastering a few steps to get there, so let's start:


Q: why parallel operations are taking more time?

Because the proposed step with joblib creates that many full-scale process copies - so as to escape from the GIL-stepped pure-[SERIAL] dancing ( one-after-another ) but (!) this includes add-on costs of all the memory transfers ( very expensive / sensitive for indeed large numpy arrays ) of all variables and the whole python interpreter and its internal state, before it gets to start doing a first step on the "usefull" work on your "payload"-calculation strategy,
so
the sum of all these instantiation overheads can easily become larger, than an overhead-agnostic expectation of an inversely proportional 1 / N factor,
where you set the N ~ num_cores.

For details, read the mathematical formulation in the tail part of Amdahl's Law re-formulation here.


Q:can help improve the efficiency of the above code?

Save as much as possible on all overhead costs:
- where possible:
- on process spawn-side, try to use n_jobs = ( num_cores - 1 ) to let more space for the "main" process going forward and benchmark if performance goes up
- on process termination-side, avoiding to collect-and-construct a new ( possibly great in size ) object from returning values, but rather pre-allocate a just-enough large process-local data structures and return some efficient, serialised for easy and non-blocking coalescing of the per-partes returned results' alignments.

Both of these "hidden" costs are your main design enemies, as they get linearly added to the pure-[SERIAL] part of the computing-path of the whole problem solution ( ref.: the effects of both of these in the overhead-strict Amdahl's Law formula )


Experiments & Results:

>>> from zmq import Stopwatch; aClk = Stopwatch()
>>> base_array = np.ones( (2**12, 2**12), dtype = np.uint8 )
>>> base_array.flags
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA      : True
  WRITEABLE    : True
  ALIGNED      : True
  UPDATEIFCOPY : False
>>> def compute_average_per_TILE(               TILE_i,   TILE_j ): // NAIVE MODE
...     return np.uint8( np.mean( base_array[ 4*TILE_i:4*(TILE_i+1),
...                                           4*TILE_j:4*(TILE_j+1)
...                                           ]
...                               )
...                      )
... 
>>> aClk.start(); _ = compute_average_per_TILE( 12,13 ); aClk.stop()
25110
  102
  109
   93

This takes about 93 [us] per one shot. Having expectation of about 1024*1024*93 ~ 97,517,568 [us] to cover the mean-processing over the whole base_array.

Experimentally, here one nicely can see the impact of not very well handled overheads, the naive-nested experiment took:

>>> aClk.start(); _ = [ compute_average_per_TILE( i, j )
                                              for i    in xrange(1024)
                                              for    j in xrange(1024)
                        ]; aClk.stop()
26310594
^^...... 
26310594 / 1024. / 1024. == 25.09 [us/cell]

which is about 3.7x less ( due to not incurred "tail"-part ( assignment of individual returned values ) overheads 2**20 times, but just once, at the terminal assignment.

Yet, more surprises to come.


What is a proper tool here?

There is never a universal rule, no one-size-fits-all.

Given
not more than just a 4x4 matrix tile is going to be processes per call ( taking actually less than 25 [us] per a proposed joblib-orchestrated spawn of 2**20 calls, distributed over ~ .cpu_count() fully instantiated processes by the original proposal

...( joblib.Parallel( n_jobs = num_cores )( 
     joblib.delayed( compute_average )( i, j ) 
                                    for i    in xrange( 1024 )
                                    for    j in xrange( 1024 )
     )

there is indeed a space to improve the performance.

For these small-scale matrices ( not all problems are so happy in this sense ), one can expect best results from smarter-memory access patterns and from reducing python GIL-originated weaknesses.

As the per-call span is just a 4x4 micro sized computation, a way better will be to harness smart vectorisation ( all data fit in cache, so in-cache computing is a holiday journey for hunting an utmost performance )

The best ( still very naively vectorised code )
was able to get from ~ 25 [us/cell] to less than ~ 74 [ns/cell] ( having there still a space for better aligned processing, as it took ~ 4.6 [ns] / a base_array cell processing ), so expect yet another level of speedups, if in-cache optimised vectorised code will get crafted properly.

In 77 [ms] ?! Worth doing that right, isn't it?

Not 97 seconds,
not 25 seconds,
but less than 77 [ms] in just a few strokes of keyboard, and more could have got squeezed off, if better optimising a call-signature:

>>> import numba
>>> @numba.jit( nogil = True, nopython = True )
... def jit_avg2( base_IN, ret_OUT ):  // all pre-allocated memory for these data-structures
...     for i in np.arange( 1024 ):    // vectorised-code ready numpy iterator
...         for j in np.arange( 1024 ):// vectorised-code ready numpy iterator 
...             ret_OUT[i,j] = np.uint8( np.mean( base_IN[4*i:4*(i+1),
...                                                       4*j:4*(j+1)
...                                                       ]
...                                               )
...                                      )
...     return                         // avoid terminal assignment costs
... 

>>> aClk.start(); _ = jit_avg2( base_array, mean_array ); aClk.stop()
1586182 (even with all the jit-compilation circus, it was FASTER than GIL-stepped nested fors ...)
  76935
  77337

这篇关于python中二维数组操作的高效并行化操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆