使用 Numba 时如何并行化这个 Python for 循环 [英] How to parallelize this Python for loop when using Numba

查看:27
本文介绍了使用 Numba 时如何并行化这个 Python for 循环的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Python 的 Anaconda 发行版和 Numba,并且我编写了以下 Python 函数来乘以一个稀疏矩阵 A(存储在一个CSR 格式)由一个密集向量 x:

@jitdef csrMult( x, Adata, Aindices, Aindptr, Ashape ):numRowsA = Ashape[0]Ax = numpy.zeros( numRowsA )对于 i 在范围内( numRowsA ):Ax_i = 0.0对于范围内的 dataIdx(Aindptr[i], Aindptr[i+1]):j = Aindices[dataIdx]Ax_i += Adata[dataIdx] * x[j]Ax[i] = Ax_i返回斧头

这里A是一个大的scipy稀疏矩阵,

<预><代码>>>>一个形状( 56469, 39279 )# 有 ~ 142,258,302 个非零条目(所以大约 6.4%)>>>类型( A[0,0] )dtype('float32')

x 是一个 numpy 数组.这是调用上述函数的一段代码:

x = numpy.random.randn( A.shape[1] )Ax = A.dot( x )AxCheck = csrMult( x, A.data, A.indices, A.indptr, A.shape )

注意 @jit 装饰器,它告诉 Numba 对 csrMult() 进行即时编译 功能.

在我的实验中,我的函数 csrMult() 大约是 scipy .dot() 方法.对于 Numba 来说,这是一个令人印象深刻的结果.

但是,MATLAB 执行矩阵向量乘法的速度仍然比 csrMult()6 倍.我相信这是因为 MATLAB 在执行稀疏矩阵向量乘法时使用了多线程.

<小时>

问题:

如何在使用 Numba 时并行化外部 for 循环?

Numba 曾经有一个 prange() 函数,这使得并行化尴尬的并行for 变得简单-循环.不幸的是,Numba 不再有 prange() [实际上,这是错误的,请参阅下面的编辑].那么现在并行化这个 for 循环的正确方法是什么,Numba 的 prange() 函数已经消失了?

prange() 从 Numba 中移除时,Numba 的开发者有什么替代方案?

<小时><块引用>

编辑 1:
我更新到最新版本的 Numba,即 .35,prange() 又回来了!它没有包含在 .33 版本中,这是我一直在使用的版本.
这是个好消息,但不幸的是,当我尝试使用 prange() 并行化我的 for 循环时,我收到一条错误消息.这是 Numba 文档中的并行循环 example(请参阅第 1.9.2 节显式并行循环"),下面是我的新代码:

from numba import njit, prange@njit(平行=真)def csrMult_numba( x, Adata, Aindices, Aindptr, Ashape):numRowsA = Ashape[0]Ax = np.zeros( numRowsA )对于我在 prange( numRowsA ):Ax_i = 0.0对于范围内的 dataIdx(Aindptr[i],Aindptr[i+1]):j = Aindices[dataIdx]Ax_i += Adata[dataIdx] * x[j]Ax[i] = Ax_i返回斧头

当我使用上面给出的代码段调用此函数时,收到以下错误:

<块引用>

AttributeError: 在 nopython 失败(转换为 parfors)'SetItem'对象没有属性get_targets"

<小时>

鉴于
上述尝试使用 prange 崩溃,我的问题是:

什么是正确的方法(使用 prange 或替代方法)并行化这个 Python for 循环?

如下所述,在 20-omp-threads 上运行后,在 C++ 中并行化类似的 for 循环并获得 8x 加速是微不足道的.一定有一种方法可以使用 Numba 来做到这一点,因为 for 循环非常并行(并且因为稀疏矩阵向量乘法是科学计算中的基本运算).

<小时><块引用>

编辑 2:
这是我的 csrMult() 的 C++ 版本.在 C++ 版本中并行化 for() 循环使我的测试中的代码速度提高了大约 8 倍.这向我表明,在使用 Numba 时,Python 版本应该可以实现类似的加速.

void csrMult(VectorXd& Ax, VectorXd& x, vector& Adata, vector& Aindices, vector& Aindptr){//此代码假设 Ax 的大小为 numRowsA.#pragma omp 并行 num_threads(20){#pragma omp for schedule(dynamic,590)for (int i = 0; i < Ax.size(); i++){双 Ax_i = 0.0;for (int dataIdx = Aindptr[i]; dataIdx < Aindptr[i + 1]; dataIdx++){Ax_i += Adata[dataIdx] * x[Aindices[dataIdx]];}Ax[i] = Ax_i;}}}

解决方案

Numba 已经更新,prange() 现在可以工作了! (我正在回答我自己的问题.)

Numba 并行计算能力的改进在这个博文,日期为 2017 年 12 月 12 日.以下是该博客的相关片段:

<块引用>

很久以前(超过 20 个版本!),Numba 曾经支持用于编写名为 prange() 的并行 for 循环的习惯用法.经过一个大2014 年对代码库进行重构,必须删除此功能,但它一直是最常请求的 Numba 功能之一从那之后.在英特尔开发人员并行化阵列之后表达式,他们意识到带回 prange 将是公平的容易

使用 Numba 0.36.1 版,我可以使用以下简单代码并行化我尴尬的并行 for 循环:

@numba.jit(nopython=True, parallel=True)def csrMult_parallel(x,Adata,Aindices,Aindptr,Ashape):numRowsA = Ashape[0]Ax = np.zeros(numRowsA)对于 numba.prange(numRowsA) 中的 i:Ax_i = 0.0对于范围内的 dataIdx(Aindptr[i],Aindptr[i+1]):j = Aindices[dataIdx]Ax_i += Adata[dataIdx]*x[j]Ax[i] = Ax_i返回斧头

在我的实验中,并行化 for 循环使函数的执行速度比我在问题开头发布的版本快八倍,该版本已经在使用 Numba,但没有并行化.此外,在我的实验中,并行化版本比使用 scipy 的稀疏矩阵向量乘法函数的命令 Ax = A.dot(x) 快约 5 倍.Numba 打败了 scipy,我终于有了一个与 MATLAB 一样快的 python 稀疏矩阵向量乘法例程.

I'm using the Anaconda distribution of Python, together with Numba, and I've written the following Python function that multiplies a sparse matrix A (stored in a CSR format) by a dense vector x:

@jit
def csrMult( x, Adata, Aindices, Aindptr, Ashape ):

    numRowsA = Ashape[0]
    Ax       = numpy.zeros( numRowsA )

    for i in range( numRowsA ):
        Ax_i = 0.0
        for dataIdx in range( Aindptr[i], Aindptr[i+1] ):

            j     = Aindices[dataIdx]
            Ax_i +=    Adata[dataIdx] * x[j]

        Ax[i] = Ax_i

    return Ax 

Here A is a large scipy sparse matrix,

>>> A.shape
( 56469, 39279 )
#                  having ~ 142,258,302 nonzero entries (so about 6.4% )
>>> type( A[0,0] )
dtype( 'float32' )

and x is a numpy array. Here is a snippet of code that calls the above function:

x       = numpy.random.randn( A.shape[1] )
Ax      = A.dot( x )   
AxCheck = csrMult( x, A.data, A.indices, A.indptr, A.shape )

Notice the @jit-decorator that tells Numba to do a just-in-time compilation for the csrMult() function.

In my experiments, my function csrMult() is about twice as fast as the scipy .dot() method. That is a pretty impressive result for Numba.

However, MATLAB still performs this matrix-vector multiplication about 6 times faster than csrMult(). I believe that is because MATLAB uses multithreading when performing sparse matrix-vector multiplication.


Question:

How can I parallelize the outer for-loop when using Numba?

Numba used to have a prange() function, that made it simple to parallelize embarassingly parallel for-loops. Unfortunately, Numba no longer has prange() [actually, that is false, see the edit below]. So what is the correct way to parallelize this for-loop now, that Numba's prange() function is gone?

When prange() was removed from Numba, what alternative did the developers of Numba have in mind?


Edit 1:
I updated to the latest version of Numba, which is .35, and prange() is back! It was not included in version .33, the version I had been using.
That is good news, but unfortunately I am getting an error message when I attempt to parallelize my for loop using prange(). Here is a parallel for loop example from the Numba documentation (see section 1.9.2 "Explicit Parallel Loops"), and below is my new code:

from numba import njit, prange
@njit( parallel=True )
def csrMult_numba( x, Adata, Aindices, Aindptr, Ashape):

    numRowsA = Ashape[0]    
    Ax       = np.zeros( numRowsA )

    for i in prange( numRowsA ):
        Ax_i = 0.0        
        for dataIdx in range( Aindptr[i],Aindptr[i+1] ):

            j     = Aindices[dataIdx]
            Ax_i +=    Adata[dataIdx] * x[j]

        Ax[i] = Ax_i            

    return Ax 

When I call this function, using the code snippet given above, I receive the following error:

AttributeError: Failed at nopython (convert to parfors) 'SetItem' object has no attribute 'get_targets'


Given
the above attempt to use prange crashes, my question stands:

What is the correct way ( using prange or an alternative method ) to parallelize this Python for-loop?

As noted below, it was trivial to parallelize a similar for loop in C++ and obtain an 8x speedup, having been run on 20-omp-threads. There must be a way to do it using Numba, since the for loop is embarrassingly parallel (and since sparse matrix-vector multiplication is a fundamental operation in scientific computing).


Edit 2:
Here is my C++ version of csrMult(). Parallelizing the for() loop in the C++ version makes the code about 8x faster in my tests. This suggests to me that a similar speedup should be possible for the Python version when using Numba.

void csrMult(VectorXd& Ax, VectorXd& x, vector<double>& Adata, vector<int>& Aindices, vector<int>& Aindptr)
{
    // This code assumes that the size of Ax is numRowsA.
    #pragma omp parallel num_threads(20)
    {       
        #pragma omp for schedule(dynamic,590) 
        for (int i = 0; i < Ax.size(); i++)
        {
            double Ax_i = 0.0;
            for (int dataIdx = Aindptr[i]; dataIdx < Aindptr[i + 1]; dataIdx++)
            {
                Ax_i += Adata[dataIdx] * x[Aindices[dataIdx]];
            }

            Ax[i] = Ax_i;
        }
    }
}

解决方案

Numba has been updated and prange() works now! (I'm answering my own question.)

The improvements to Numba's parallel computing capabilities are discussed in this blog post, dated December 12, 2017. Here is a relevant snippet from the blog:

Long ago (more than 20 releases!), Numba used to have support for an idiom to write parallel for loops called prange(). After a major refactoring of the code base in 2014, this feature had to be removed, but it has been one of the most frequently requested Numba features since that time. After the Intel developers parallelized array expressions, they realized that bringing back prange would be fairly easy

Using Numba version 0.36.1, I can parallelize my embarrassingly parallel for-loop using the following simple code:

@numba.jit(nopython=True, parallel=True)
def csrMult_parallel(x,Adata,Aindices,Aindptr,Ashape): 

    numRowsA = Ashape[0]    
    Ax = np.zeros(numRowsA)

    for i in numba.prange(numRowsA):
        Ax_i = 0.0        
        for dataIdx in range(Aindptr[i],Aindptr[i+1]):

            j = Aindices[dataIdx]
            Ax_i += Adata[dataIdx]*x[j]

        Ax[i] = Ax_i            

    return Ax

In my experiments, parallelizing the for-loop made the function execute about eight times faster than the version I posted at the beginning of my question, which was already using Numba, but which was not parallelized. Moreover, in my experiments the parallelized version is about 5x faster than the command Ax = A.dot(x) which uses scipy's sparse matrix-vector multiplication function. Numba has crushed scipy and I finally have a python sparse matrix-vector multiplication routine that is as fast as MATLAB.

这篇关于使用 Numba 时如何并行化这个 Python for 循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆