为什么此numba代码比numpy代码慢6倍? [英] Why this numba code is 6x slower than numpy code?

查看:73
本文介绍了为什么此numba代码比numpy代码慢6倍?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

下面的代码是否有理由在2秒内运行

Is there any reason why the following code run in 2s,

def euclidean_distance_square(x1, x2):
    return -2*np.dot(x1, x2.T) + np.expand_dims(np.sum(np.square(x1), axis=1), axis=1) + np.sum(np.square(x2), axis=1)

以下numba代码在12秒内运行?

while the following numba code run in 12s?

@jit(nopython=True)
def euclidean_distance_square(x1, x2):
   return -2*np.dot(x1, x2.T) + np.expand_dims(np.sum(np.square(x1), axis=1), axis=1) + np.sum(np.square(x2), axis=1)

我的x1是维数(1,512)的矩阵,而x2是维数(3000000,512)的矩阵. numba的速度这么慢是很奇怪的.我使用错了吗?

My x1 is a matrix of dimension (1, 512) and x2 is a matrix of dimension (3000000, 512). It is quite weird that numba can be so much slower. Am I using it wrong?

我真的需要加快速度,因为我需要运行此功能300万次,而2s仍然太慢.

I really need to speed this up because I need to run this function 3 million times and 2s is still way too slow.

我需要在CPU上运行它,因为您可以看到x2的尺寸是如此之大,无法将其加载到GPU(或至少是我的GPU)上,而内存不足.

I need to run this on CPU because as you can see the dimension of x2 is so huge, it cannot be loaded onto a GPU (or at least my GPU), not enough memory.

推荐答案

很奇怪,numba的速度要慢得多.

It is quite weird that numba can be so much slower.

这不太奇怪.在numba函数内调用NumPy函数时,将调用这些函数的numba版本.它们可以更快,更慢或与NumPy版本一样快.您可能很幸运,也可能不走运(您不走运!).但是即使在numba函数中,您仍然会创建许多临时对象,因为您使用了NumPy函数(一个临时数组用于点结果,每个临时数组用于每个平方和和,一个用于点加第一个和),因此您无法利用numba的可能性.

It's not too weird. When you call NumPy functions inside a numba function you call the numba-version of these functions. These can be faster, slower or just as fast as the NumPy versions. You might be lucky or you can be unlucky (you were unlucky!). But even in the numba function you still create lots of temporaries because you use the NumPy functions (one temporary array for the dot result, one for each square and sum, one for the dot plus first sum) so you don't take advantage of the possibilities with numba.

我用错了吗?

Am I using it wrong?

本质上:是的.

我真的需要加快速度

I really need to speed this up

好的,我会尝试的.

让我们从沿轴1调用展开平方和开始吧:

Let's start with unrolling the sum of squares along axis 1 calls:

import numba as nb

@nb.njit
def sum_squares_2d_array_along_axis1(arr):
    res = np.empty(arr.shape[0], dtype=arr.dtype)
    for o_idx in range(arr.shape[0]):
        sum_ = 0
        for i_idx in range(arr.shape[1]):
            sum_ += arr[o_idx, i_idx] * arr[o_idx, i_idx]
        res[o_idx] = sum_
    return res


@nb.njit
def euclidean_distance_square_numba_v1(x1, x2):
    return -2 * np.dot(x1, x2.T) + np.expand_dims(sum_squares_2d_array_along_axis1(x1), axis=1) + sum_squares_2d_array_along_axis1(x2)

在我的计算机上,它已经比NumPy代码快2倍,比原始Numba代码快10倍.

On my computer that's already 2 times faster than the NumPy code and almost 10 times faster than your original Numba code.

从经验上讲,使它快于NumPy 2倍通常是限制(至少在NumPy版本不是不必要的复杂或低效的情况下),但是您可以通过展开所有内容来挤出更多内容:

Speaking from experience getting it 2x faster than NumPy is generally the limit (at least if the NumPy version isn't needlessly complicated or inefficient), however you can squeeze out a bit more by unrolling everything:

import numba as nb

@nb.njit
def euclidean_distance_square_numba_v2(x1, x2):
    f1 = 0.
    for i_idx in range(x1.shape[1]):
        f1 += x1[0, i_idx] * x1[0, i_idx]

    res = np.empty(x2.shape[0], dtype=x2.dtype)
    for o_idx in range(x2.shape[0]):
        val = 0
        for i_idx in range(x2.shape[1]):
            val_from_x2 = x2[o_idx, i_idx]
            val += (-2) * x1[0, i_idx] * val_from_x2 + val_from_x2 * val_from_x2
        val += f1
        res[o_idx] = val
    return res

但是与最新方法相比,这只能带来10-20%的改善.

But that only gives a ~10-20% improvement over the latest approach.

到那时,您可能会意识到您可以简化代码(即使它可能不会加快速度):

At that point you might realize that you can simplify the code (even though it probably won't speed it up):

import numba as nb

@nb.njit
def euclidean_distance_square_numba_v3(x1, x2):
    res = np.empty(x2.shape[0], dtype=x2.dtype)
    for o_idx in range(x2.shape[0]):
        val = 0
        for i_idx in range(x2.shape[1]):
            tmp = x1[0, i_idx] - x2[o_idx, i_idx]
            val += tmp * tmp
        res[o_idx] = val
    return res

是的,看起来很简单,而且速度并不慢.

Yeah, that look pretty straight-forward and it's not really slower.

但是,在所有令人兴奋的事情中,我忘了提到显而易见的解决方案:

However in all the excitement I forgot to mention the obvious solution: scipy.spatial.distance.cdist which has a sqeuclidean (squared euclidean distance) option:

from scipy.spatial import distance
distance.cdist(x1, x2, metric='sqeuclidean')

它实际上并不比numba快,但无需编写自己的函数就可以使用...

It's not really faster than numba but it's available without having to write your own function...

测试正确性并进行预热:

Test for correctness and do the warmups:

x1 = np.array([[1.,2,3]])
x2 = np.array([[1.,2,3], [2,3,4], [3,4,5], [4,5,6], [5,6,7]])

res1 = euclidean_distance_square(x1, x2)
res2 = euclidean_distance_square_numba_original(x1, x2)
res3 = euclidean_distance_square_numba_v1(x1, x2)
res4 = euclidean_distance_square_numba_v2(x1, x2)
res5 = euclidean_distance_square_numba_v3(x1, x2)
np.testing.assert_array_equal(res1, res2)
np.testing.assert_array_equal(res1, res3)
np.testing.assert_array_equal(res1[0], res4)
np.testing.assert_array_equal(res1[0], res5)
np.testing.assert_almost_equal(res1, distance.cdist(x1, x2, metric='sqeuclidean'))

时间:

x1 = np.random.random((1, 512))
x2 = np.random.random((1000000, 512))

%timeit euclidean_distance_square(x1, x2)
# 2.09 s ± 54.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit euclidean_distance_square_numba_original(x1, x2)
# 10.9 s ± 158 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit euclidean_distance_square_numba_v1(x1, x2)
# 907 ms ± 7.11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit euclidean_distance_square_numba_v2(x1, x2)
# 715 ms ± 15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit euclidean_distance_square_numba_v3(x1, x2)
# 731 ms ± 34.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit distance.cdist(x1, x2, metric='sqeuclidean')
# 706 ms ± 4.99 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

注意:如果您有整数数组,则可能需要将numba函数中的硬编码0.0更改为0.

Note: If you have arrays of integers you might want to change the hard-coded 0.0 in the numba functions to 0.

这篇关于为什么此numba代码比numpy代码慢6倍?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆