有效地返回数组中第一个值满足条件的索引 [英] Efficiently return the index of the first value satisfying condition in array
问题描述
我需要在一个满足条件的1d NumPy数组或Pandas数值序列中找到第一个值的索引.数组很大,索引可能在数组的开始或末尾附近,或可能根本不满足条件.我无法提前告诉您哪种可能性更大.如果不满足条件,则返回值应为-1
.我考虑过几种方法.
尝试1
# func(arr) returns a Boolean array
idx = next(iter(np.where(func(arr))[0]), -1)
但是这通常太慢了,因为func(arr)
在 entire 数组上应用矢量化函数,而不是在满足条件时停止.具体来说,在数组的 start 附近满足条件的情况会很昂贵.
尝试2
np.argmax
速度稍快,但是无法确定何时从未满足:
np.random.seed(0)
arr = np.random.rand(10**7)
assert next(iter(np.where(arr > 0.999999)[0]), -1) == np.argmax(arr > 0.999999)
%timeit next(iter(np.where(arr > 0.999999)[0]), -1) # 21.2 ms
%timeit np.argmax(arr > 0.999999) # 17.7 ms
np.argmax(arr > 1.0)
返回0
,即满足 条件的实例.
尝试3
# func(arr) returns a Boolean scalar
idx = next((idx for idx, val in enumerate(arr) if func(arr)), -1)
但是当在数组的 end 附近满足条件时,这太慢了.大概是因为生成器表达式由于大量的__next__
调用而产生了昂贵的开销.
这是总是的折衷方案,还是通用func
有办法有效地提取第一个索引?
基准化
对于基准测试,假定func
在值大于给定常量时找到索引:
# Python 3.6.5, NumPy 1.14.3, Numba 0.38.0
import numpy as np
np.random.seed(0)
arr = np.random.rand(10**7)
m = 0.9
n = 0.999999
# Start of array benchmark
%timeit next(iter(np.where(arr > m)[0]), -1) # 43.5 ms
%timeit next((idx for idx, val in enumerate(arr) if val > m), -1) # 2.5 µs
# End of array benchmark
%timeit next(iter(np.where(arr > n)[0]), -1) # 21.4 ms
%timeit next((idx for idx, val in enumerate(arr) if val > n), -1) # 39.2 ms
numba
使用> numba
,可以优化两种情况.从语法上讲,您只需要构造一个具有简单for
循环的函数:
from numba import njit
@njit
def get_first_index_nb(A, k):
for i in range(len(A)):
if A[i] > k:
return i
return -1
idx = get_first_index_nb(A, 0.9)
Numba通过JIT(及时")编译代码并利用函数用作参数,并假设传递的函数也可以进行JIT编译,则可以找到一种计算第 n 个索引的方法,该条件满足任意func
的条件.
@njit
def get_nth_index_count(A, func, count):
c = 0
for i in range(len(A)):
if func(A[i]):
c += 1
if c == count:
return i
return -1
@njit
def func(val):
return val > 0.9
# get index of 3rd value where func evaluates to True
idx = get_nth_index_count(arr, func, 3)
对于第三个 last 值,您可以输入反向的arr[::-1]
,并取反len(arr) - 1
的结果,而len(arr) - 1
是计算0索引所必需的- 1
.
性能基准测试
# Python 3.6.5, NumPy 1.14.3, Numba 0.38.0
np.random.seed(0)
arr = np.random.rand(10**7)
m = 0.9
n = 0.999999
@njit
def get_first_index_nb(A, k):
for i in range(len(A)):
if A[i] > k:
return i
return -1
def get_first_index_np(A, k):
for i in range(len(A)):
if A[i] > k:
return i
return -1
%timeit get_first_index_nb(arr, m) # 375 ns
%timeit get_first_index_np(arr, m) # 2.71 µs
%timeit next(iter(np.where(arr > m)[0]), -1) # 43.5 ms
%timeit next((idx for idx, val in enumerate(arr) if val > m), -1) # 2.5 µs
%timeit get_first_index_nb(arr, n) # 204 µs
%timeit get_first_index_np(arr, n) # 44.8 ms
%timeit next(iter(np.where(arr > n)[0]), -1) # 21.4 ms
%timeit next((idx for idx, val in enumerate(arr) if val > n), -1) # 39.2 ms
I need to find the index of the first value in a 1d NumPy array, or Pandas numeric series, satisfying a condition. The array is large and the index may be near the start or end of the array, or the condition may not be met at all. I can't tell in advance which is more likely. If the condition is not met, the return value should be -1
. I've considered a few approaches.
Attempt 1
# func(arr) returns a Boolean array
idx = next(iter(np.where(func(arr))[0]), -1)
But this is often too slow as func(arr)
applies a vectorised function on the entire array rather than stopping when the condition is met. Specifically, it is expensive when the condition is met near the start of the array.
Attempt 2
np.argmax
is marginally faster, but fails to identify when a condition is never met:
np.random.seed(0)
arr = np.random.rand(10**7)
assert next(iter(np.where(arr > 0.999999)[0]), -1) == np.argmax(arr > 0.999999)
%timeit next(iter(np.where(arr > 0.999999)[0]), -1) # 21.2 ms
%timeit np.argmax(arr > 0.999999) # 17.7 ms
np.argmax(arr > 1.0)
returns 0
, i.e. an instance when the condition is not satisfied.
Attempt 3
# func(arr) returns a Boolean scalar
idx = next((idx for idx, val in enumerate(arr) if func(arr)), -1)
But this is too slow when the condition is met near the end of the array. Presumably this is because the generator expression has an expensive overhead from a large number of __next__
calls.
Is this always a compromise or is there a way, for generic func
, to extract the first index efficiently?
Benchmarking
For benchmarking, assume func
finds the index when a value is greater than a given constant:
# Python 3.6.5, NumPy 1.14.3, Numba 0.38.0
import numpy as np
np.random.seed(0)
arr = np.random.rand(10**7)
m = 0.9
n = 0.999999
# Start of array benchmark
%timeit next(iter(np.where(arr > m)[0]), -1) # 43.5 ms
%timeit next((idx for idx, val in enumerate(arr) if val > m), -1) # 2.5 µs
# End of array benchmark
%timeit next(iter(np.where(arr > n)[0]), -1) # 21.4 ms
%timeit next((idx for idx, val in enumerate(arr) if val > n), -1) # 39.2 ms
numba
With numba
it's possible to optimise both scenarios. Syntactically, you need only construct a function with a simple for
loop:
from numba import njit
@njit
def get_first_index_nb(A, k):
for i in range(len(A)):
if A[i] > k:
return i
return -1
idx = get_first_index_nb(A, 0.9)
Numba improves performance by JIT ("Just In Time") compiling code and leveraging CPU-level optimisations. A regular for
loop without the @njit
decorator would typically be slower than the methods you've already tried for the case where the condition is met late.
For a Pandas numeric series df['data']
, you can simply feed the NumPy representation to the JIT-compiled function:
idx = get_first_index_nb(df['data'].values, 0.9)
Generalisation
Since numba
permits functions as arguments, and assuming the passed the function can also be JIT-compiled, you can arrive at a method to calculate the nth index where a condition is met for an arbitrary func
.
@njit
def get_nth_index_count(A, func, count):
c = 0
for i in range(len(A)):
if func(A[i]):
c += 1
if c == count:
return i
return -1
@njit
def func(val):
return val > 0.9
# get index of 3rd value where func evaluates to True
idx = get_nth_index_count(arr, func, 3)
For the 3rd last value, you can feed the reverse, arr[::-1]
, and negate the result from len(arr) - 1
, the - 1
necessary to account for 0-indexing.
Performance benchmarking
# Python 3.6.5, NumPy 1.14.3, Numba 0.38.0
np.random.seed(0)
arr = np.random.rand(10**7)
m = 0.9
n = 0.999999
@njit
def get_first_index_nb(A, k):
for i in range(len(A)):
if A[i] > k:
return i
return -1
def get_first_index_np(A, k):
for i in range(len(A)):
if A[i] > k:
return i
return -1
%timeit get_first_index_nb(arr, m) # 375 ns
%timeit get_first_index_np(arr, m) # 2.71 µs
%timeit next(iter(np.where(arr > m)[0]), -1) # 43.5 ms
%timeit next((idx for idx, val in enumerate(arr) if val > m), -1) # 2.5 µs
%timeit get_first_index_nb(arr, n) # 204 µs
%timeit get_first_index_np(arr, n) # 44.8 ms
%timeit next(iter(np.where(arr > n)[0]), -1) # 21.4 ms
%timeit next((idx for idx, val in enumerate(arr) if val > n), -1) # 39.2 ms
这篇关于有效地返回数组中第一个值满足条件的索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!