过滤NumPy数组:最佳方法是什么? [英] Filtering a NumPy Array: what is the best approach?

查看:91
本文介绍了过滤NumPy数组:最佳方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个NumPy数组arr,我想按元素进行过滤,例如 我只想获取低于特定阈值k的值.

Suppose I have a NumPy array arr that I want to element-wise filter, e.g. I want to get only values below a certain threshold value k.

有两种方法,例如:

  1. 使用生成器:np.fromiter((x for x in arr if x < k), dtype=arr.dtype)
  2. 使用布尔蒙版切片:arr[arr < k]
  3. 使用np.where():arr[np.where(arr < k)]
  4. 使用np.nonzero():arr[np.nonzero(arr < k)]
  5. 使用基于Cython的自定义实现
  6. 使用基于Numba的自定义实现
  1. Using generators: np.fromiter((x for x in arr if x < k), dtype=arr.dtype)
  2. Using boolean mask slicing: arr[arr < k]
  3. Using np.where(): arr[np.where(arr < k)]
  4. Using np.nonzero(): arr[np.nonzero(arr < k)]
  5. Using a Cython-based custom implementation(s)
  6. Using a Numba-based custom implementation(s)

哪个最快?内存效率如何?

Which is the fastest? What about memory efficiency?

(基于@ShadowRanger注释添加了np.nonzero())

(EDITED: Added np.nonzero() based on @ShadowRanger comment)

推荐答案

定义

  1. 使用发电机:

def filter_fromiter(arr, k):
    return np.fromiter((x for x in arr if x < k), dtype=arr.dtype)

  1. 使用布尔蒙版切片:

def filter_mask(arr, k):
    return arr[arr < k]

  1. 使用np.where():

def filter_where(arr, k):
    return arr[np.where(arr < k)]

  1. 使用np.nonzero()

def filter_nonzero(arr, k):
    return arr[np.nonzero(arr < k)]

  1. 使用基于Cython的自定义实现:
    • 单次通过filter_cy()
    • 两次通过filter2_cy()
  1. Using a Cython-based custom implementation(s):
    • single-pass filter_cy()
    • two-passes filter2_cy()

%%cython -c-O3 -c-march=native -a
#cython: language_level=3, boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True, infer_types=True


cimport numpy as cnp
cimport cython as ccy

import numpy as np
import cython as cy


cdef long NUM = 1048576
cdef long MAX_VAL = 1048576
cdef long K = 1048576 // 2


cdef int smaller_than_cy(long x, long k=K):
    return x < k


cdef size_t _filter_cy(long[:] arr, long[:] result, size_t size, long k):
    cdef size_t j = 0
    for i in range(size):
        if smaller_than_cy(arr[i]):
            result[j] = arr[i]
            j += 1
    return j


cpdef filter_cy(arr, k):
    result = np.empty_like(arr)
    new_size = _filter_cy(arr, result, arr.size, k)
    return result[:new_size].copy()


cdef size_t _filtered_size(long[:] arr, size_t size, long k):
    cdef size_t j = 0
    for i in range(size):
        if smaller_than_cy(arr[i]):
            j += 1
    return j


cpdef filter2_cy(arr, k):
    cdef size_t new_size = _filtered_size(arr, arr.size, k)
    result = np.empty(new_size, dtype=arr.dtype)
    new_size = _filter_cy(arr, result, arr.size, k)
    return result

  1. 使用基于Numba的自定义实现
    • 单次通过filter_np_nb()
    • 两次通过filter2_np_nb()
  1. Using a Numba-based custom implementation
    • single-pass filter_np_nb()
    • two-passes filter2_np_nb()

import numba as nb


@nb.jit
def filter_func(x, k=K):
    return x < k


@nb.jit
def filter_np_nb(arr):
    result = np.empty_like(arr)
    j = 0
    for i in range(arr.size):
        if filter_func(arr[i]):
            result[j] = arr[i]
            j += 1
    return result[:j].copy()


@nb.jit
def filter2_np_nb(arr):
    j = 0
    for i in range(arr.size):
        if filter_func(arr[i]):
            j += 1
    result = np.empty(j, dtype=arr.dtype)
    j = 0
    for i in range(arr.size):
        if filter_func(arr[i]):
            result[j] = arr[i]
            j += 1
    return result


计时基准

基于生成器的filter_fromiter()方法比其他方法慢得多(降低了大约2个数量级,因此在图表中将其省略).


Timing Benchmarks

The generator-based filter_fromiter() method is much slower than the others (by approx. 2 orders of magnitude and it is therefore omitted in the charts).

时间将取决于输入数组的大小和已过滤项目的百分比.

The timing would depend on both the input array size and the percent of filtered items.

第一张图将时序作为输入大小的函数(对于约50%滤除的元素):

The first graph addresses the timings as a function of the input size (for ~50% filtered out elements):

通常,基于Numba的方法始终是最快的方法,紧随其后的是Cython方法.在其中,两次通过方法对于中型和大型输入最快.在NumPy中,基于np.where()和基于np.nonzero()的方法基本相同(除了很小的输入(对于np.nonzero()来说,它的速度似乎稍慢一些),它们都比布尔蒙版切片更快.对于很小的输入(低于100个元素),布尔掩码切片速度更快. 而且,对于很小的输入,基于Cython的解决方案要比基于NumPy的解决方案慢.

In general, the Numba based approach is consistently the fastest, closely followed by the Cython approach. Within them, the two-passes approaches are fastest for medium and larger inputs. Within NumPy, the np.where()-based and np.nonzero()-based approaches are basically the same (except for very small inputs for which np.nonzero() seems to be slightly slower), and they are both faster than the boolean mask slicing, except for very small inputs (below ~100 elements) where the boolean mask slicing is faster. Moreover, for very small inputs, the Cython based solution are slower than NumPy-based ones.

第二张图将时序作为通过过滤器的项的函数(固定输入大小为一百万个元素):

The second graph addresses the timings as a function of items passing through the filter (for a fixed input size of ~1 million elements):

第一个观察结果是,当达到〜50%填充量时,所有方法最慢,而填充量更少或更多时,它们则更快,并且朝着不填充量最快(滤出值的最高百分比,通过值的最低百分比)如图表的x轴所示). 同样,Numba和Cython版本通常都比基于NumPy的版本更快,其中Numba几乎总是最快,而Cython在图表的最右端胜过Numba. 值得注意的例外是,当填充率接近100%时,单遍Numba/Cython版本基本上会被复制大约.两次,布尔型蒙版切片解决方案最终跑赢了它们. 对于较大的填充距离,两遍方法具有越来越高的边际速度增益. 在NumPy中,基于np.where()和基于np.nonzero()的方法再次基本相同. 在比较基于NumPy的解决方案时,np.where()/np.nonzero()解决方案几乎总是优于布尔蒙版切片,除了图的最右端,布尔蒙版切片最快.

The first observation is that all methods are slowest when approaching a ~50% filling and with less, or more filling they are faster, and fastest towards no filling (highest percent of filtered-out values, lowest percent of passing through values as indicated in the x-axis of the graph). Again, both Numba and Cython version are typically faster than the NumPy-based counterparts, with Numba being fastest almost always and Cython winning over Numba for the outermost right part of the graph. The notable exception to this is when the filling is close to 100%, when single-pass Numba/Cython versions gets basically copied approx. twice and the boolean mask slicing solution eventually outperforms them. The two-passes approaches have increasing marginal speed gains for larger filling vaules. Within NumPy, the np.where()-based and np.nonzero()-based approaches are again basically the same. When comparing NumPy-based solution, the np.where()/np.nonzero() solutions outperform the boolean mask slicing almost always, except for the outermost right part of the graph, where boolean mask slicing becomes the fastest.

(完整代码可用此处)

基于生成器的filter_fromiter()方法仅需要最少的临时存储,而与输入的大小无关. 在内存方面,这是最有效的方法. Cython/Numba两遍方法具有类似的内存效率,因为输出大小是在第一遍过程中确定的.

The generator-based filter_fromiter() method requires only minimal temporary storage, independently of the size of the input. Memory-wise this is the most efficient method. Of similar memory efficiency are the Cython / Numba two-passes methods, because the size of the output is determined during the first pass.

在存储器方面,Cython和Numba的单通解决方案都需要一个临时的输入大小数组. 因此,这些是内存效率最低的方法.

On the memory side, the single-pass solutions for both Cython and Numba require a temporary array of the size of the input. Hence, these are the least memory-efficient methods.

布尔型掩码切片解决方案需要一个输入大小但类型为bool的临时数组,该数组在NumPy中为1位,因此它比典型值上NumPy数组的默认大小小约64倍. 64位系统.

The boolean mask slicing solution requires a temporary array of the size of the input but of type bool, which in NumPy is 1 bit, so this is ~64 times smaller than the default size of a NumPy array on a typical 64-bit system.

基于np.where()的解决方案与第一步(在np.where()内部)的布尔掩码切片具有相同的要求,该布尔掩码切片被转换为一系列int s(通常在64-but上为int64)第二步(np.where()的输出).因此,第二步具有可变的内存要求,具体取决于已过滤元素的数量.

The np.where() based solution has the same requirement as the boolean mask slicing in the first step (inside np.where()), which gets converted to a series of ints (typically int64 on a 64-but system) in the second step (the output of np.where()). This second step, therefore, has variable memory requirements, depending on the number of filtered elements.

  • 在指定不同的过滤条件时,generator方法也是最灵活的
  • Cython解决方案要求指定数据类型以使其快速
  • 对于Numba和Cython,
  • 可以将过滤条件指定为通用函数(因此不需要进行硬编码),但是必须在各自的环境中指定过滤条件,并且必须小心确保针对速度进行了适当的编译,或者观察到明显的变慢
  • 单遍解决方案在返回之前确实需要额外的.copy()以避免浪费内存
  • 由于
  • the generator method is also the most flexible when it comes to specifying a different filtering condition
  • the Cython solution requires specifying the data types for it to be fast
  • for both Numba and Cython, the filtering condition can be specified as a generic function (and therefore does not need to be hardcoded), but it must be specified within their respective environment, and care must be taken to make sure that this is properly compiled for speed, or substantial slowdowns are observed
  • the single-pass solutions DO require an extra .copy() right before return to avoid wasting memory
  • the NumPy methods do NOT return a view of the input, but a copy, as a result of advanced indexing:
arr = np.arange(100)
k = 50
print('`arr[arr > k]` is a copy: ', arr[arr > k].base is None)
# `arr[arr > k]` is a copy:  True
print('`arr[np.where(arr > k)]` is a copy: ', arr[np.where(arr > k)].base is None)
# `arr[np.where(arr > k)]` is a copy:  True
print('`arr[:k]` is a copy: ', arr[:k].base is None)
# `arr[:k]` is a copy:  False


(已在单次通过的Cython/Numba版本中包含基于np.nonzero()的解决方案和固定的内存泄漏,包括了两次通过的Cython/Numba版本-基于@ ShadowRanger,@ PaulPanzer和@ max9111注释.)


(EDITED: Included np.nonzero()-based solutions and fixed memory leaks in the single-pass Cython/Numba versions, included two-passes Cython/Numba versions -- based on @ShadowRanger, @PaulPanzer and @max9111 comments.)

这篇关于过滤NumPy数组:最佳方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆