各种 numpy 花式索引方法的性能,也使用 numba [英] Performance of various numpy fancy indexing methods, also with numba

查看:22
本文介绍了各种 numpy 花式索引方法的性能,也使用 numba的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

因为对于我的程序,Numpy 数组的快速索引是非常必要的,而且考虑到性能,花哨的索引没有很好的声誉,我决定进行一些测试.特别是因为 Numba 发展得非常快,我尝试了哪些方法与 numba 配合得很好.

Since for my program fast indexing of Numpy arrays is quite necessary and fancy indexing doesn't have a good reputation considering performance, I decided to make a few tests. Especially since Numba is developing quite fast, I tried which methods work well with numba.

作为输入,我一直在使用以下数组进行小数组测试:

As inputs I've been using the following arrays for my small-arrays-test:

import numpy as np
import numba as nb

x = np.arange(0, 100, dtype=np.float64)  # array to be indexed
idx = np.array((0, 4, 55, -1), dtype=np.int32)  # fancy indexing array
bool_mask = np.zeros(x.shape, dtype=np.bool)  # boolean indexing mask
bool_mask[idx] = True  # set same elements as in idx True
y = np.zeros(idx.shape, dtype=np.float64)  # output array
y_bool = np.zeros(bool_mask[bool_mask == True].shape, dtype=np.float64)  #bool output array (only for convenience)

以及用于我的大数组测试的以下数组(这里需要 y_bool 来处理来自 randint 的重复数字):

And the following arrays for my large-arrays-test (y_bool needed here to cope with dupe numbers from randint):

x = np.arange(0, 1000000, dtype=np.float64)
idx = np.random.randint(0, 1000000, size=int(1000000/50))
bool_mask = np.zeros(x.shape, dtype=np.bool)
bool_mask[idx] = True
y = np.zeros(idx.shape, dtype=np.float64)
y_bool = np.zeros(bool_mask[bool_mask == True].shape, dtype=np.float64)

这会在不使用 numba 的情况下产生以下计时:

This yields the following timings without using numba:

%timeit x[idx]
#1.08 µs ± 21 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
#large arrays: 129 µs ± 3.45 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit x[bool_mask]
#482 ns ± 18.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
#large arrays: 621 µs ± 15.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit np.take(x, idx)
#2.27 µs ± 104 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# large arrays: 112 µs ± 5.76 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit np.take(x, idx, out=y)
#2.65 µs ± 134 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# large arrays: 134 µs ± 4.47 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit x.take(idx)
#919 ns ± 21.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# large arrays: 108 µs ± 1.71 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit x.take(idx, out=y)
#1.79 µs ± 40.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# larg arrays: 131 µs ± 2.92 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit np.compress(bool_mask, x)
#1.93 µs ± 95.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# large arrays: 618 µs ± 15.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit np.compress(bool_mask, x, out=y_bool)
#2.58 µs ± 167 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# large arrays: 637 µs ± 9.88 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit x.compress(bool_mask)
#900 ns ± 82.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# large arrays: 628 µs ± 17.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit x.compress(bool_mask, out=y_bool)
#1.78 µs ± 59.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# large arrays: 628 µs ± 13.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit np.extract(bool_mask, x)
#5.29 µs ± 194 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# large arrays: 641 µs ± 13 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

并使用numba,在nopython-mode、caching 和nogil 中使用抖动,我修饰了方式numba 支持的索引:

And with numba, using jitting in nopython-mode, caching and nogil I decorated the ways of indexing, which are supported by numba:

@nb.jit(nopython=True, cache=True, nogil=True)
def fancy(x, idx):
    x[idx]

@nb.jit(nopython=True, cache=True, nogil=True)
def fancy_bool(x, bool_mask):
    x[bool_mask]

@nb.jit(nopython=True, cache=True, nogil=True)
def taker(x, idx):
    np.take(x, idx)

@nb.jit(nopython=True, cache=True, nogil=True)
def ndtaker(x, idx):
    x.take(idx)

对于小型和大型数组,这会产生以下结果:

This yields the following results for small and large arrays:

%timeit fancy(x, idx)
#686 ns ± 25.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# large arrays: 84.7 µs ± 1.82 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit fancy_bool(x, bool_mask)
#845 ns ± 31 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# large arrays: 843 µs ± 14.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit taker(x, idx)
#814 ns ± 21.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# large arrays: 87 µs ± 1.52 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit ndtaker(x, idx)
#831 ns ± 24.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# large arrays: 85.4 µs ± 2.69 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

<小时>

总结

虽然对于没有 numba 的 numpy,很明显,小数组最好用布尔掩码索引(与 ndarray.take(idx) 相比,大约是因子 2),对于较大的数组 ndarray.take(idx) 将表现最佳,在这种情况下比布尔索引快 6 倍左右.盈亏平衡点位于大约 1000 个单元格的数组大小,以及大约 20 个单元格的索引数组大小.
对于具有 1e5 元素和 5e3 索引数组大小的数组,ndarray.take(idx) 将快 10 倍左右 比布尔掩码索引.因此,布尔索引似乎随着数组大小而显着减慢,但在达到某个数组大小阈值后会稍微赶上.

While for numpy without numba it is clear that small arrays are by far best indexed with boolean masks (about a factor 2 compared to ndarray.take(idx)), for larger arrays ndarray.take(idx) will perform best, in this case around 6 times faster than boolean indexing. The breakeven-point is at an array-size of around 1000 cells with and index-array-size of around 20 cells.
For arrays with 1e5 elements and 5e3 index array size, ndarray.take(idx) will be around 10 times faster than boolean mask indexing. So it seems that boolean indexing seems to slow down considerably with array size, but catches up a little after some array-size-threshold is reached.

对于 numba jitted 函数,除了布尔掩码索引之外,所有索引函数都有一个小的加速.简单的花哨索引在这里效果最好,但仍然比没有抖动的布尔掩码要慢.
对于较大的数组,布尔掩码索引比其他方法慢很多,甚至比非 jitted 版本慢.其他三种方法都表现得相当好,比非 jitted 版本快 15% 左右.

For the numba jitted functions there is a small speedup for all indexing functions except for boolean mask indexing. Simple fancy indexing works best here, but is still slower than boolean masking without jitting.
For larger arrays boolean mask indexing is a lot slower than the other methods, and even slower than the non-jitted version. The three other methods all perform quite good and around 15% faster than the non-jitted version.

对于我有许多不同大小数组的情况,使用 numba 进行花式索引是最好的方法.也许其他一些人也可以在这篇相当长的帖子中找到一些有用的信息.

For my case with many arrays of different sizes, fancy indexing with numba is the best way to go. Perhaps some other people can also find some useful information in this quite lengthy post.


很抱歉我忘了问我的问题,事实上我有.我只是在工作日结束时快速打字并完全忘记了它......那么,你知道比我测试的那些更好更快的方法吗?使用 Cython,我的时间在 Numba 和 Python 之间.
由于索引数组被预定义一次并且在长时间迭代中无需更改即可使用,因此任何预定义索引过程的方式都会很棒.为此,我考虑使用 strides.但我无法预先定义一组自定义的步幅.是否可以使用 strides 获得预定义的内存视图?


I'm sorry that I forgot to ask my question, which I in fact have. I was just rapidly typing this at the end of my workday and completely forgot it... Well, do you know any better and faster method than those that I tested? Using Cython my timings were between Numba and Python.
As the index array is predefined once and used without alteration in long iterations, any way of pre-defining the indexing process would be great. For this I thought about using strides. But I wasn't able to pre-define a custom set of strides. Is it possible to get a predefined view into the memory using strides?


我想我会将我的关于预定义常量索引数组的问题移到一个新的和更具体的问题上,该数组将在相同的值数组(其中只有值改变而不是形状)上迭代数百万次.这个问题太笼统了,也许我也提出了一个有点误导的问题.我会在打开新问题后立即在此处发布链接!
这是后续问题的链接.

推荐答案

您的总结并不完全正确,您已经对不同大小的数组进行了测试,但是您没有做的一件事是更改索引元素的数量.

Your summary isn't completely correct, you already did tests with differently sized arrays but one thing that you didn't do was to change the number of elements indexed.

我将它限制为纯索引并省略了 take(实际上是整数数组索引)和 compressextract(因为这些是有效的布尔数组索引).这些唯一的区别是常数因子.takecompress 方法的常数因子将小于 numpy 函数 np.takenp.compress 的开销 否则,对于合理大小的数组,影响可以忽略不计.

I restricted it to pure indexing and omitted take (which effectively is integer array indexing) and compress and extract (because these are effectively boolean array indexing). The only difference for these are the constant factors. The constant factor for the methods take and compress will be less than the overhead for the numpy functions np.take and np.compress but otherwise the effects will be negligible for reasonably sized arrays.

让我用不同的数字来表示:

Just let me present it with different numbers:

# ~ every 500th element
x = np.arange(0, 1000000, dtype=np.float64)
idx = np.random.randint(0, 1000000, size=int(1000000/500))  # changed the ratio!
bool_mask = np.zeros(x.shape, dtype=np.bool)
bool_mask[idx] = True

%timeit x[idx]
# 51.6 µs ± 2.02 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit x[bool_mask]
# 1.03 ms ± 37.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


# ~ every 50th element
idx = np.random.randint(0, 1000000, size=int(1000000/50))  # changed the ratio!
bool_mask = np.zeros(x.shape, dtype=np.bool)
bool_mask[idx] = True

%timeit x[idx]
# 1.46 ms ± 55.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit x[bool_mask]
# 2.69 ms ± 154 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


# ~ every 5th element
idx = np.random.randint(0, 1000000, size=int(1000000/5))  # changed the ratio!
bool_mask = np.zeros(x.shape, dtype=np.bool)
bool_mask[idx] = True

%timeit x[idx]
# 14.9 ms ± 495 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit x[bool_mask]
# 8.31 ms ± 181 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

那么这里发生了什么?很简单:整数数组索引只需要访问与索引数组中的值一样多的元素.这意味着如果匹配很少,它会很快,但如果有很多索引,它会很慢.然而,布尔数组索引总是需要遍历整个布尔数组并检查真"值.这意味着它应该是数组的大致常量".

So what happened here? It's simple: Integer array indexing only needs to access as many elements as there are values in the index-array. That means if there are few matches it will be quite fast but slow if there are many indices. Boolean array indexing, however, always needs to walk through the whole boolean array and check for "true" values. That means it should be roughly "constant" for the array.

但是,等等,它对于布尔数组来说并不是真正的常量,为什么整数数组索引比布尔数组索引花费更长的时间(最后一种情况),即使它必须处理的元素少约 5 倍?

But, wait, it's not really constant for boolean arrays and why does integer array indexing take longer (last case) than boolean array indexing even if it has to process ~5 times less elements?

这就是它变得更复杂的地方.在这种情况下,布尔数组在随机位置具有 True,这意味着它将受到分支预测失败的影响.如果 TrueFalse 的出现次数相同但出现在随机位置,则这些情况更有可能发生.这就是布尔数组索引变慢的原因 - 因为 TrueFalse 的比率变得更相等,因此更随机".此外,如果 True 数量更多,结果数组也会更大,这也会消耗更多时间.

That's where it gets more complicated. In this case the boolean array had True at random places which means that it will be subject to branch prediction failures. These will be more likely if True and False will have equal occurrences but at random places. That's why the boolean array indexing got slower - because the ratio of True to False got more equal and thus more "random". Also the result array will be larger if there are more Trues which also consumes more time.

作为这个分支预测的例子,使用这个作为例子(可能因不同的系统/编译器而异):

As an example for this branch prediction thing use this as example (could differ with different system/compilers):

bool_mask = np.zeros(x.shape, dtype=np.bool)
bool_mask[:1000000//2] = True   # first half True, second half False
%timeit x[bool_mask]
# 5.92 ms ± 118 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

bool_mask = np.zeros(x.shape, dtype=np.bool)
bool_mask[::2] = True   # True and False alternating
%timeit x[bool_mask]
# 16.6 ms ± 361 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

bool_mask = np.zeros(x.shape, dtype=np.bool)
bool_mask[::2] = True
np.random.shuffle(bool_mask)  # shuffled
%timeit x[bool_mask]
# 18.2 ms ± 325 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

因此,TrueFalse 的分布将严重影响布尔掩码的运行时,即使它们包含相同数量的 Truecompress-functions 也可以看到相同的效果.

So the distribution of True and False will critically affect the runtime with boolean masks even if they contain the same amount of Trues! The same effect will be visible for the compress-functions.

对于整数数组索引(以及同样的 np.take),另一个效果将是可见的:缓存局部性.您的情况下的索引是随机分布的,因此您的计算机必须执行大量RAM"来处理器缓存"加载,因为两个索引不太可能彼此靠近.

For integer array indexing (and likewise np.take) another effect will be visible: cache locality. The indices in your case are randomly distributed, so your computer has to do a lot of "RAM" to "processor cache" loads because it's very unlikely two indices will be near to each other.

比较一下:

idx = np.random.randint(0, 1000000, size=int(1000000/5))
%timeit x[idx]
# 15.6 ms ± 703 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

idx = np.random.randint(0, 1000000, size=int(1000000/5))
idx = np.sort(idx)  # sort them
%timeit x[idx]
# 4.33 ms ± 366 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

通过对索引进行排序,下一个值已经在缓存中的机会大大增加,这可能会导致巨大的加速.如果您知道索引将被排序,这是一个非常重要的因素(例如,如果它们是由 np.where 创建的,它们会被排序,这使得 np.where> 对索引特别有效).

By sorting the indices the chances immensely increased that the next value will already be in the cache and this can lead to huge speedups. That's a very important factor if you know that the indices will be sorted (for example if they were created by np.where they are sorted, which makes the result of np.where especially efficient for indexing).

因此,它不像整数数组索引对于小数组更慢而对于大数组更快,它取决于更多因素.两者都有自己的用例,并且根据情况,一个可能(明显)比另一个快.

So, it's not like integer array indexing is slower for small arrays and faster for large arrays it depends on much more factors. Both do have their use-cases and depending on the circumstances one might be (considerably) faster than the other.

让我也谈谈 numba 函数.首先是一些一般性声明:

Let me also talk a bit about the numba functions. First some general statements:

  • cache 不会有什么不同,它只是避免重新编译函数.在交互式环境中,这基本上是无用的.不过,如果您将函数打包在一个模块中,速度会更快.
  • nogil 本身不会提供任何速度提升.如果在不同的线程中调用它会更快,因为每个函数执行都可以释放 GIL,然后多个调用可以并行运行.
  • cache won't make a difference, it just avoids recompiling the function. In interactive environments this is essentially useless. It's faster if you would package the functions in a module though.
  • nogil by itself won't provide any speed boost. It will be faster if it's called in different threads because each function execution can release the GIL and then multiple calls can run in parallel.

否则我不知道 numba 如何有效地实现这些功能,但是当您在 numba 中使用 NumPy 功能时,它可能会变慢或变快 - 但即使它更快,也不会快得多(小数组除外).因为如果它可以做得更快,NumPy 开发人员也会实现它.我的经验法则是:如果你能用 NumPy 做到(矢量化),就不要用 numba.只有当你不能用向量化的 NumPy 函数来做到这一点,或者 NumPy 会使用太多的临时数组时,numba 才会发光!

Otherwise I don't know how numba effectivly implements these functions, however when you use NumPy features in numba it could be slower or faster - but even if it's faster it won't be much faster (except maybe for small arrays). Because if it could be made faster the NumPy developers would also implement it. My rule of thumb is: If you can do it (vectorized) with NumPy don't bother with numba. Only if you can't do it with vectorized NumPy functions or NumPy would use too many temporary arrays then numba will shine!

这篇关于各种 numpy 花式索引方法的性能,也使用 numba的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆