numba guvectorize target ='parallel'慢于target ='cpu' [英] numba guvectorize target='parallel' slower than target='cpu'

查看:126
本文介绍了numba guvectorize target ='parallel'慢于target ='cpu'的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在尝试优化一段涉及大型多维数组计算的python代码.我在numba上得到了违反直觉的结果.我在2015年中期的MBP上运行,2.5 GHz i7四核,OS 10.10.5,python 2.7.11.请考虑以下内容:

I've been attempting to optimize a piece of python code that involves large multi-dimensional array calculations. I am getting counterintuitive results with numba. I am running on an MBP, mid 2015, 2.5 GHz i7 quadcore, OS 10.10.5, python 2.7.11. Consider the following:

 import numpy as np
 from numba import jit, vectorize, guvectorize
 import numexpr as ne
 import timeit

 def add_two_2ds_naive(A,B,res):
     for i in range(A.shape[0]):
         for j in range(B.shape[1]):
             res[i,j] = A[i,j]+B[i,j]

 @jit
 def add_two_2ds_jit(A,B,res):
     for i in range(A.shape[0]):
         for j in range(B.shape[1]):
             res[i,j] = A[i,j]+B[i,j]

 @guvectorize(['float64[:,:],float64[:,:],float64[:,:]'],
    '(n,m),(n,m)->(n,m)',target='cpu')
 def add_two_2ds_cpu(A,B,res):
     for i in range(A.shape[0]):
         for j in range(B.shape[1]):
             res[i,j] = A[i,j]+B[i,j]

 @guvectorize(['(float64[:,:],float64[:,:],float64[:,:])'],
    '(n,m),(n,m)->(n,m)',target='parallel')
 def add_two_2ds_parallel(A,B,res):
     for i in range(A.shape[0]):
         for j in range(B.shape[1]):
             res[i,j] = A[i,j]+B[i,j]

 def add_two_2ds_numexpr(A,B,res):
     res = ne.evaluate('A+B')

 if __name__=="__main__":
     np.random.seed(69)
     A = np.random.rand(10000,100)
     B = np.random.rand(10000,100)
     res = np.zeros((10000,100))

我现在可以在各种功能上运行timeit了:

I can now run timeit on the various functions:

%timeit add_two_2ds_jit(A,B,res)
1000 loops, best of 3: 1.16 ms per loop

%timeit add_two_2ds_cpu(A,B,res)
1000 loops, best of 3: 1.19 ms per loop

%timeit add_two_2ds_parallel(A,B,res)
100 loops, best of 3: 6.9 ms per loop

%timeit add_two_2ds_numexpr(A,B,res)
1000 loops, best of 3: 1.62 ms per loop

似乎'parallel'甚至不使用单个内核的大部分,因为top中的用法表明python的'parallel'达到〜40%cpu,'cpu'达到〜100%,并且numexpr达到〜300%.

It seems that 'parallel' is not taking even using the majority of a single core, as it's usage in top shows that python is hitting ~40% cpu for 'parallel', ~100% for 'cpu', and numexpr hits ~300%.

推荐答案

@guvectorize实现存在两个问题.首先是您正在@guvectorize内核中进行所有循环,因此Numba并行目标实际上没有什么可以并行化. @vectorize和@guvectorize都在ufunc/gufunc中的广播维度上并行化.由于gufunc的签名是2D,而您的输入是2D,因此只有一次对内部函数的调用,这解释了您看到的唯一100%CPU使用率.

There are two issues with your @guvectorize implementations. The first is that you are are doing all the looping inside your @guvectorize kernel, so there is actually nothing for the Numba parallel target to parallelize. Both @vectorize and @guvectorize parallelize on the broadcast dimensions in a ufunc/gufunc. Since the signature of your gufunc is 2D, and your inputs are 2D, there is only a single call to the inner function, which explains the only 100% CPU usage you saw.

编写上面具有的功能的最佳方法是使用常规ufunc:

The best way to write the function you have above is to use a regular ufunc:

@vectorize('(float64, float64)', target='parallel')
def add_ufunc(a, b):
    return a + b

然后在我的系统上,我看到了这些速度:

Then on my system, I see these speeds:

%timeit add_two_2ds_jit(A,B,res)
1000 loops, best of 3: 1.87 ms per loop

%timeit add_two_2ds_cpu(A,B,res)
1000 loops, best of 3: 1.81 ms per loop

%timeit add_two_2ds_parallel(A,B,res)
The slowest run took 11.82 times longer than the fastest. This could mean that an intermediate result is being cached 
100 loops, best of 3: 2.43 ms per loop

%timeit add_two_2ds_numexpr(A,B,res)
100 loops, best of 3: 2.79 ms per loop

%timeit add_ufunc(A, B, res)
The slowest run took 9.24 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 2.03 ms per loop

(这与您的OS X系统非常相似,但使用的是OS X 10.11.)

(This is a very similar OS X system to yours, but with OS X 10.11.)

尽管Numba的并行ufunc现在击败了numexpr(我看到add_ufunc使用的是大约280%的CPU),但它没有击败简单的单线程CPU情况.我怀疑瓶颈是由于内存(或缓存)带宽引起的,但是我还没有进行测量来检查这一点.

Although Numba's parallel ufunc now beats numexpr (and I see add_ufunc using about 280% CPU), it doesn't beat the simple single-threaded CPU case. I suspect that the bottleneck is due to memory (or cache) bandwidth, but I haven't done the measurements to check that.

通常来说,如果您对每个存储元素(例如余弦)执行更多的数学运算,您将从并行ufunc目标中受益更多.

Generally speaking, you will see much more benefit from the parallel ufunc target if you are doing more math operations per memory element (like, say, a cosine).

这篇关于numba guvectorize target ='parallel'慢于target ='cpu'的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆