根据另一个参考数组从一个数组中选择接近匹配 [英] Selecting close matches from one array based on another reference array
问题描述
我有一个数组 A
和一个引用数组 B
.A
的大小至少和 B
一样大.例如
I have an array A
and a reference array B
. Size of A
is at least as big as B
. e.g.
A = [2,100,300,793,1300,1500,1810,2400]
B = [4,305,789,1234,1890]
B
实际上是信号在指定时间的峰值位置,而A
包含稍后时间的峰值位置.但是 A
中的一些元素实际上不是我想要的峰值(可能是由于噪声等),我想在 A
中找到真实"的一个基于B
.A
中的'real'元素应该与B
中的元素接近,在上面给出的例子中,A
中的'real'元素应该是 A'=[2,300,793,1300,1810]
.在这个例子中应该很明显 100,1500,2400
不是我们想要的,因为它们与 B 中的任何元素都相距甚远.我如何以最有效的方式编写代码/python/matlab中的准确方法?
B
is in fact the position of peaks in a signal at a specified time, and A
contains position of peaks at a later time. But some of the elements in A
are actually not the peaks I want (might be due to noise, etc), and I want to find the 'real' one in A
based on B
. The 'real' elements in A
should be close to those in B
, and in the example given above, the 'real' ones in A
should be A'=[2,300,793,1300,1810]
. It should be obvious in this example that 100,1500,2400
are not the ones we want as they are quite far off from any of the elements in B. How can I code this in the most efficient/accurate way in python/matlab?
推荐答案
方法 #1: 使用 NumPy broadcast
,我们可以寻找输入数组之间的绝对元素减法,并使用适当的阈值过滤掉不需要的元素来自 A
.对于给定的样本输入,90
的阈值似乎有效.
Approach #1: With NumPy broadcasting
, we can look for absolute element-wise subtractions between the input arrays and use an appropriate threshold to filter out unwanted elements from A
. It seems for the given sample inputs, a threshold of 90
works.
因此,我们将有一个实现,就像这样 -
Thus, we would have an implementation, like so -
thresh = 90
Aout = A[(np.abs(A[:,None] - B) < thresh).any(1)]
样品运行 -
In [69]: A
Out[69]: array([ 2, 100, 300, 793, 1300, 1500, 1810, 2400])
In [70]: B
Out[70]: array([ 4, 305, 789, 1234, 1890])
In [71]: A[(np.abs(A[:,None] - B) < 90).any(1)]
Out[71]: array([ 2, 300, 793, 1300, 1810])
方法#2:基于这篇文章
,这是使用 np.searchsorted代码>
,这对于大型数组可能是至关重要的 -
Approach #2: Based on this post
, here's a memory efficient approach using np.searchsorted
, which could be crucial for large arrays -
def searchsorted_filter(a, b, thresh):
choices = np.sort(b) # if b is already sorted, skip it
lidx = np.searchsorted(choices, a, 'left').clip(max=choices.size-1)
ridx = (np.searchsorted(choices, a, 'right')-1).clip(min=0)
cl = np.take(choices,lidx) # Or choices[lidx]
cr = np.take(choices,ridx) # Or choices[ridx]
return a[np.minimum(np.abs(a - cl), np.abs(a - cr)) < thresh]
样品运行 -
In [95]: searchsorted_filter(A,B, thresh = 90)
Out[95]: array([ 2, 300, 793, 1300, 1810])
<小时>
运行时测试
In [104]: A = np.sort(np.random.randint(0,100000,(1000)))
In [105]: B = np.sort(np.random.randint(0,100000,(400)))
In [106]: out1 = A[(np.abs(A[:,None] - B) < 10).any(1)]
In [107]: out2 = searchsorted_filter(A,B, thresh = 10)
In [108]: np.allclose(out1, out2) # Verify results
Out[108]: True
In [109]: %timeit A[(np.abs(A[:,None] - B) < 10).any(1)]
100 loops, best of 3: 2.74 ms per loop
In [110]: %timeit searchsorted_filter(A,B, thresh = 10)
10000 loops, best of 3: 85.3 µs per loop
<小时>
2018 年 1 月更新,进一步提升性能
我们可以通过使用从 np.searchsorted(..., 'left')
以及 absolute
计算,就像这样 -
We can avoid the second usage of np.searchsorted(..., 'right')
by making use of the indices obtained from np.searchsorted(..., 'left')
and also the absolute
computations, like so -
def searchsorted_filter_v2(a, b, thresh):
N = len(b)
choices = np.sort(b) # if b is already sorted, skip it
l = np.searchsorted(choices, a, 'left')
l_invalid_mask = l==N
l[l_invalid_mask] = N-1
left_offset = choices[l]-a
left_offset[l_invalid_mask] *= -1
r = (l - (left_offset!=0))
r_invalid_mask = r<0
r[r_invalid_mask] = 0
r += l_invalid_mask
right_offset = a-choices[r]
right_offset[r_invalid_mask] *= -1
out = a[(left_offset < thresh) | (right_offset < thresh)]
return out
更新时间以测试进一步的加速 -
Updated timings to test the further speedup -
In [388]: np.random.seed(0)
...: A = np.random.randint(0,1000000,(100000))
...: B = np.unique(np.random.randint(0,1000000,(40000)))
...: np.random.shuffle(B)
...: thresh = 10
...:
...: out1 = searchsorted_filter(A, B, thresh)
...: out2 = searchsorted_filter_v2(A, B, thresh)
...: print np.allclose(out1, out2)
True
In [389]: %timeit searchsorted_filter(A, B, thresh)
10 loops, best of 3: 24.2 ms per loop
In [390]: %timeit searchsorted_filter_v2(A, B, thresh)
100 loops, best of 3: 13.9 ms per loop
深入挖掘-
In [396]: a = A; b = B
In [397]: N = len(b)
...:
...: choices = np.sort(b) # if b is already sorted, skip it
...:
...: l = np.searchsorted(choices, a, 'left')
In [398]: %timeit np.sort(B)
100 loops, best of 3: 2 ms per loop
In [399]: %timeit np.searchsorted(choices, a, 'left')
100 loops, best of 3: 10.3 ms per loop
似乎searchsorted
和sort
占用了几乎所有的运行时间,而且它们似乎对这种方法必不可少.因此,使用这种基于排序的方法似乎无法进一步改进.
Seems like searchsorted
and sort
are taking almost all of the runtime and they seem essential to this method. So, doesn't seem like it could be improved any further staying with this sort-based approach.
这篇关于根据另一个参考数组从一个数组中选择接近匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!