使用Numba时如何并行化此Python for循环 [英] How to parallelize this Python for loop when using Numba
问题描述
我正在使用Python的Anaconda发行版以及Numba,并且编写了以下Python函数,该函数将稀疏矩阵 A
(以CSR格式存储)乘以密集向量 x
:
I'm using the Anaconda distribution of Python, together with Numba, and I've written the following Python function that multiplies a sparse matrix A
(stored in a CSR format) by a dense vector x
:
@jit
def csrMult( x, Adata, Aindices, Aindptr, Ashape ):
numRowsA = Ashape[0]
Ax = numpy.zeros( numRowsA )
for i in range( numRowsA ):
Ax_i = 0.0
for dataIdx in range( Aindptr[i], Aindptr[i+1] ):
j = Aindices[dataIdx]
Ax_i += Adata[dataIdx] * x[j]
Ax[i] = Ax_i
return Ax
A
是一个大型scipy
稀疏矩阵,
>>> A.shape
( 56469, 39279 )
# having ~ 142,258,302 nonzero entries (so about 6.4% )
>>> type( A[0,0] )
dtype( 'float32' )
和 x
是一个numpy
数组.这是调用上述功能的代码片段:
and x
is a numpy
array. Here is a snippet of code that calls the above function:
x = numpy.random.randn( A.shape[1] )
Ax = A.dot( x )
AxCheck = csrMult( x, A.data, A.indices, A.indptr, A.shape )
请注意 @jit
装饰器,该装饰器告诉Numba对 csrMult()
函数进行即时编译.
Notice the @jit
-decorator that tells Numba to do a just-in-time compilation for the csrMult()
function.
在我的实验中,我的函数csrMult()
大约是 .dot()
>方法的两倍.对于Numba来说,这是一个非常令人印象深刻的结果.
In my experiments, my function csrMult()
is about twice as fast as the scipy
.dot()
method. That is a pretty impressive result for Numba.
但是,MATLAB仍然比csrMult()
快 6倍来执行矩阵向量乘法.我相信这是因为MATLAB在执行稀疏矩阵矢量乘法时会使用多线程.
However, MATLAB still performs this matrix-vector multiplication about 6 times faster than csrMult()
. I believe that is because MATLAB uses multithreading when performing sparse matrix-vector multiplication.
使用Numba时如何并行化外部for
循环?
How can I parallelize the outer for
-loop when using Numba?
Numba曾经有一个 prange()
函数,可以轻松并行化尴尬地并行的 for
循环.不幸的是,Numba不再具有prange()
[实际上是错误的,请参见下面的编辑]. 那么,现在并行化此for
循环的正确方法是什么,而Numba的prange()
函数就消失了?
Numba used to have a prange()
function, that made it simple to parallelize embarassingly parallel for
-loops. Unfortunately, Numba no longer has prange()
[actually, that is false, see the edit below]. So what is the correct way to parallelize this for
-loop now, that Numba's prange()
function is gone?
从Numba中删除prange()
时,Numba的开发人员会想到什么替代方案?
When prange()
was removed from Numba, what alternative did the developers of Numba have in mind?
我更新了Numba的最新版本,即0.35,然后prange()
又回来了!我一直使用的版本.33中未包含该文件.
这是个好消息,但是不幸的是,当我尝试使用prange()
并行化for循环时,我收到一条错误消息.这是Numba文档中的并行for循环示例(请参见第1.9.2节显式并行循环"),下面是我的新代码:
Edit 1:
I updated to the latest version of Numba, which is .35, andprange()
is back! It was not included in version .33, the version I had been using.
That is good news, but unfortunately I am getting an error message when I attempt to parallelize my for loop usingprange()
. Here is a parallel for loop example from the Numba documentation (see section 1.9.2 "Explicit Parallel Loops"), and below is my new code:
from numba import njit, prange
@njit( parallel=True )
def csrMult_numba( x, Adata, Aindices, Aindptr, Ashape):
numRowsA = Ashape[0]
Ax = np.zeros( numRowsA )
for i in prange( numRowsA ):
Ax_i = 0.0
for dataIdx in range( Aindptr[i],Aindptr[i+1] ):
j = Aindices[dataIdx]
Ax_i += Adata[dataIdx] * x[j]
Ax[i] = Ax_i
return Ax
当我使用上面给出的代码片段调用此函数时,收到以下错误:
When I call this function, using the code snippet given above, I receive the following error:
AttributeError:在nopython失败(转换为parfors)'SetItem' 对象没有属性"get_targets"
AttributeError: Failed at nopython (convert to parfors) 'SetItem' object has no attribute 'get_targets'
Given
上述尝试使用prange
崩溃时,我的问题仍然存在:
什么是正确的方法(使用prange
或其他方法)并行化此Python for
-循环?
Given
the above attempt to use prange
crashes, my question stands:
What is the correct way ( using prange
or an alternative method ) to parallelize this Python for
-loop?
如下所述,在 20 -omp-threads上运行时,并行化C ++中类似的for循环并获得 8x 加速是微不足道的.必须有一种使用Numba的方法,因为for循环令人尴尬地是并行的(并且稀疏矩阵向量乘法是科学计算中的基本运算).
As noted below, it was trivial to parallelize a similar for loop in C++ and obtain an 8x speedup, having been run on 20-omp-threads. There must be a way to do it using Numba, since the for loop is embarrassingly parallel (and since sparse matrix-vector multiplication is a fundamental operation in scientific computing).
这是我的csrMult()
的C ++版本.在C ++版本中,将for()
循环并行化可使代码在我的测试中快8倍.这向我暗示,使用Numba时,Python版本应该可以实现类似的加速.
Edit 2:
Here is my C++ version ofcsrMult()
. Parallelizing thefor()
loop in the C++ version makes the code about 8x faster in my tests. This suggests to me that a similar speedup should be possible for the Python version when using Numba.
void csrMult(VectorXd& Ax, VectorXd& x, vector<double>& Adata, vector<int>& Aindices, vector<int>& Aindptr)
{
// This code assumes that the size of Ax is numRowsA.
#pragma omp parallel num_threads(20)
{
#pragma omp for schedule(dynamic,590)
for (int i = 0; i < Ax.size(); i++)
{
double Ax_i = 0.0;
for (int dataIdx = Aindptr[i]; dataIdx < Aindptr[i + 1]; dataIdx++)
{
Ax_i += Adata[dataIdx] * x[Aindices[dataIdx]];
}
Ax[i] = Ax_i;
}
}
}
推荐答案
感谢您的量化更新,Daniel.
以下几行可能难以理解,但请相信我,还有很多事情要做考虑到.我已经研究了矩阵规模小于 N [TB]; N > 10
的HPC/并行计算问题,并且它们的稀疏伴奏,因此一些经验可能对您的进一步观点很有用.
使一段代码并行化的愿望听起来像是越来越现代的重新表达的法术力. 问题不是是代码,而是此举的成本.
A wish to parallelise a piece of code sounds like a more and more often contemporary re-articulated mana. The problem is not the code, but the cost of such move.
经济是头号问题.由基因阿姆达尔(Gene Amdahl)最初制定的阿姆达尔定律(Amdahl's Law)并未完全考虑[PAR]
-过程-设置+ + [PAR]
-过程-最终化&的成本.终止,实际上必须在每个实际的实现中都要付费.
The economy is the number one problem. Amdahl's Law, as it was originally formulated by Gene Amdahl, did not take into account the very costs of [PAR]
-processes-setups + [PAR]
-processes-finalisations & terminations, which indeed have to be paid in every real-world implementation.
开销严格的阿姆达尔定律描述了这些不可避免的不利影响的规模,并有助于理解在选择引入并行化之前,几乎不需要评估任何新方面(这样做的代价是可以接受的,因为确实非常容易,因为天真的失望,付出了不止一个的代价)降低处理性能是故事中更容易的部分.
The overhead-strict Amdahl's Law depicts the scale of these un-avoidable adverse effects and helps understand a few new aspects that have to be evaluated before one opts to introduce parallelisation ( at an acceptable cost of doing so, as it is very, indeed VERY EASY to pay MUCH more than one may gain from -- where a naive disappointment from a degraded processing performance is the easier part of the story ).
如果愿意更好地理解此主题并预先计算实际的最低要求,请随时阅读有关开销限制的阿姆达尔定律重新制定的更多信息. em>-subProblem-" size ",为此,-[PAR]
开销之和至少会从实际工具中得到证明将subProblem的并行拆分引入到 N_trully_[PAR]_processes
(不是任何"just"-[CONCURRENT]
,而是true- [PARALLEL]
等于).
Feel free to read more posts on overhead-strict Amdahl's Law re-formulation, if willing to better understand this topic and to pre-calculate the actual "minimum"-subProblem-"size", for which the sum-of-[PAR]
-overheads will get at least justified from real-world tools for introducing the parallel-split of the subProblem onto N_trully_[PAR]_processes
( not any "just"-[CONCURRENT]
, but true-[PARALLEL]
-- these are way not equal ).
Python是一个很棒的原型生态系统,而 numba
, numpy
和其他编译扩展与本地相比可以大大提高性能. ,通常会采用GIL步进式python(协同)处理.
Python is a great prototyping eco-system, whereas numba
, numpy
and other compiled-extensions help a lot to boost performance way farther than a native, GIL-stepped python (co-)-processing typically delivers.
在这里,您尝试通过自动执行jit()
-time词法分析器(<您将代码放在上),既应该了解"您的总体目标(要做什么),又应该提出一些矢量化技巧(如何最好 CPU指令以最大程度地提高这种代码执行的效率.)
Here, you try to enforce numba.jit()
to arrange the job almost-for-free, just by its automated jit()
-time lexical-analyser ( that you throw your code on ), which ought both "understand" your global goal ( What to do ), and also propose some vectorisation-tricks ( How best assemble a heap of CPU-instructions for maximum efficiency of such code-execution ).
这听起来很简单,但事实并非如此.
This sounds easy, but it is not.
Travis Oliphant的团队在numba
工具上取得了巨大的进步,但现实和公平的是,不要期望在.jit()
-lexer +代码中实现任何形式的自动wizardry分析,当尝试转换代码并组装更高效的机器指令流以实现高级任务的目标时.
Travis Oliphant's team has made immense progress on numba
tools, but let's be realistic and fair not to expect any form of automated-wizardry to get implemented inside a .jit()
-lexer + code-analysis, when trying to transform a code and assemble a more efficient flow of machine instructions to implement the high-level task's goal.
由于[PSPACE]
尺寸,您可能会立即忘记问numba
以某种方式有效地填充" GPU引擎数据,其内存占用量远远落后于GPU-GDDR尺寸(不讲).所有关于数学"微小"处理的太浅" GPU内核大小,都可能只是在[PAR]
中相乘,而后来在[SEQ]
中求和.)
Due to [PSPACE]
sizing, you may immediately forget to ask numba
to somehow efficiently "stuff" the GPU-engine with data, a memory-footprint of which is way behind the GPU-GDDR sizings ( not speaking at all about too-"shallow" GPU-kernel sizes for such mathematically-"tiny" processing to just multiply, potentially in [PAR]
, but to later sum in [SEQ]
).
(重新)-向GPU加载数据需要大量时间.如果为此付出了代价,那么In-GPU内存延迟对于微型" GPU内核经济也不是很友好-您的GPU-SMX代码执行将必须支付〜350-700 [ns]
只是为了获取数字((很可能不会自动重新对齐,以便在后续步骤中实现最佳结合的SM缓存友好重用,您可能会注意到,您永远不要让我重复一遍,永远不要重用单个矩阵单元格)因此,缓存本身不会在每个矩阵单元的350~700 [ns]
下提供任何东西),而而智能的纯numpy
-矢量化代码可以在每个单元上以小于1 [ns]
的速度处理矩阵矢量乘积.甚至最大的[PSPACE]
足迹.
(Re-)-Loading GPU with data takes heaps of time. If having paid that, the In-GPU memory-latencies are not very friendly for "tiny"-GPU-kernels economy either -- your GPU-SMX code-execution will have to pay ~ 350-700 [ns]
just to fetch a number ( most probably not automatically re-aligned for the best coalesced SM-cache-friendly re-use in next steps and you may notice, that you never, let me repeat it, NEVER re-use a single matrix cell at all, so caching per-se will not deliver anything under those 350~700 [ns]
per matrix cell ), while a smart pure numpy
-vectorised code can process matrix-vector product in less than 1 [ns]
per cell on even the largest [PSPACE]
-footprints.
这是一个可以比较的准绳.
That is a yardstick to compare to.
(概要分析可以更好地展示这些事实,但是该原理是众所周知的,无需测试如何将一些TB
数据移至GPU架构上以自行实现.)
( Profiling would better show here the hard-facts, but the principle is well known beforehand, without testing how to move a few TB
of data onto GPU-fabric just to realise this on one's own. )
鉴于矩阵A
的内存规模,预期的较差效果是,矩阵表示的存储稀疏组织最有可能崩溃,即使不是全部,通过密集矩阵表示法上的numba
-矢量化技巧可获得的可能的性能提升,因为有效内存获取的高速缓存行重用和稀疏性的可能性几乎为零,这也将打破任何简单的方法来实现矢量化的紧凑映射操作,并且这些将很难被轻松转换成高级的CPU硬件矢量处理资源.
Given the memory scales of the matrix A
, the worse effect to be expected is, that the sparse-organisation of the storage of the matrix representation will most probably devastate most, if not all, possible performance gains achievable by numba
-vectorised tricks on dense matrix representations, as there will likely be almost zero chance for efficient memory fetched cache-line re-uses and sparsity will also break any easy way to achieve a compact mapping of vectorised operations and these will hardly remain able to get easily translated into advanced CPU-hardware vector-processing resources.
- 总是最好预先分配向量
Ax = np.zeros_like( A[:,0] )
并将其作为另一个参数传递给代码的numba.jit()
编译部分,以避免重复支付额外的[PTIME,PSPACE]
成本来创建(再次)新内存-allocations(如果向量可疑在外部协调的迭代优化过程中使用,则更多) - 总是更好地指定(为了提高代码的性能,缩小通用性)
至少numba.jit( "f8[:]( f4[:], f4[:,:], ... )" )
-调用接口指令 - 始终查看所有可用的
numba.jit()
选项及其各自的默认值(可能会更改版本),以适合您的具体情况(禁用GIL并更好地将目标与numba
+硬件功能配合使用将始终对代码中数字密集的部分有所帮助)
- always better pre-allocate the vector
Ax = np.zeros_like( A[:,0] )
and pass it as another parameter into thenumba.jit()
-compiled parts of the code, so as to avoid repetitive paying additional[PTIME,PSPACE]
-costs for creating ( again ) new memory-allocations ( the more if the vector is suspect for being used inside an externally orchestrated iterative optimisation process ) - always better specify ( to narrow the universality, for the sake of resulting code performance )
at least thenumba.jit( "f8[:]( f4[:], f4[:,:], ... )" )
-calling interface directives - always review all
numba.jit()
-options available and their respective default values ( may change version to version ) for your specific situation ( disabling GIL and better aligning the goals withnumba
+ hardware capabilities will always help in numerically intensive parts of the code )
@jit( signature = [ numba.float32( numba.float32, numba.int32 ), # # [_v41] @decorator with a list of calling-signatures
numba.float64( numba.float64, numba.int64 ) #
], #__________________ a list of signatures for prepared alternative code-paths, to avoid a deferred lazy-compilation if undefined
nopython = False, #__________________ forces the function to be compiled in nopython mode. If not possible, compilation will raise an error.
nogil = False, #__________________ tries to release the global interpreter lock inside the compiled function. The GIL will only be released if Numba can compile the function in nopython mode, otherwise a compilation warning will be printed.
cache = False, #__________________ enables a file-based cache to shorten compilation times when the function was already compiled in a previous invocation. The cache is maintained in the __pycache__ subdirectory of the directory containing the source file.
forceobj = False, #__________________ forces the function to be compiled in object mode. Since object mode is slower than nopython mode, this is mostly useful for testing purposes.
locals = {} #__________________ a mapping of local variable names to Numba Types.
) #____________________# [_v41] ZERO <____ TEST *ALL* CALLED sub-func()-s to @.jit() too >>>>>>>>>>>>>>>>>>>>> [DONE]
def r...(...):
...
这篇关于使用Numba时如何并行化此Python for循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!