根据另一个参考数组从一个数组中选择接近匹配 [英] Selecting close matches from one array based on another reference array
问题描述
我有一个数组A
和一个引用数组B
. A
的大小至少与B
一样大.例如
I have an array A
and a reference array B
. Size of A
is at least as big as B
. e.g.
A = [2,100,300,793,1300,1500,1810,2400]
B = [4,305,789,1234,1890]
实际上,
B
是信号在指定时间的峰值位置,而A
包含稍后时间的峰值位置.但是A
中的某些元素实际上不是我想要的峰(可能是由于噪声等引起的),我想根据B
在A
中找到真实"峰. A
中的'real'元素应该与B
中的元素接近,并且在上面给出的示例中,A
中的'real'元素应该为A'=[2,300,793,1300,1810]
.在此示例中应该显而易见,100,1500,2400
不是我们想要的,因为它们与B中的任何元素都相距甚远.如何在python/matlab中以最有效/准确的方式对此进行编码?>
B
is in fact the position of peaks in a signal at a specified time, and A
contains position of peaks at a later time. But some of the elements in A
are actually not the peaks I want (might be due to noise, etc), and I want to find the 'real' one in A
based on B
. The 'real' elements in A
should be close to those in B
, and in the example given above, the 'real' ones in A
should be A'=[2,300,793,1300,1810]
. It should be obvious in this example that 100,1500,2400
are not the ones we want as they are quite far off from any of the elements in B. How can I code this in the most efficient/accurate way in python/matlab?
推荐答案
Approach #1: With NumPy broadcasting
, we can look for absolute element-wise subtractions between the input arrays and use an appropriate threshold to filter out unwanted elements from A
. It seems for the given sample inputs, a threshold of 90
works.
这样,我们将有一个实现,像这样-
Thus, we would have an implementation, like so -
thresh = 90
Aout = A[(np.abs(A[:,None] - B) < thresh).any(1)]
样品运行-
In [69]: A
Out[69]: array([ 2, 100, 300, 793, 1300, 1500, 1810, 2400])
In [70]: B
Out[70]: array([ 4, 305, 789, 1234, 1890])
In [71]: A[(np.abs(A[:,None] - B) < 90).any(1)]
Out[71]: array([ 2, 300, 793, 1300, 1810])
方法2::基于 this post
,这是一种使用内存的有效方法 np.searchsorted
,对于大型数组-
Approach #2: Based on this post
, here's a memory efficient approach using np.searchsorted
, which could be crucial for large arrays -
def searchsorted_filter(a, b, thresh):
choices = np.sort(b) # if b is already sorted, skip it
lidx = np.searchsorted(choices, a, 'left').clip(max=choices.size-1)
ridx = (np.searchsorted(choices, a, 'right')-1).clip(min=0)
cl = np.take(choices,lidx) # Or choices[lidx]
cr = np.take(choices,ridx) # Or choices[ridx]
return a[np.minimum(np.abs(a - cl), np.abs(a - cr)) < thresh]
样品运行-
In [95]: searchsorted_filter(A,B, thresh = 90)
Out[95]: array([ 2, 300, 793, 1300, 1810])
运行时测试
In [104]: A = np.sort(np.random.randint(0,100000,(1000)))
In [105]: B = np.sort(np.random.randint(0,100000,(400)))
In [106]: out1 = A[(np.abs(A[:,None] - B) < 10).any(1)]
In [107]: out2 = searchsorted_filter(A,B, thresh = 10)
In [108]: np.allclose(out1, out2) # Verify results
Out[108]: True
In [109]: %timeit A[(np.abs(A[:,None] - B) < 10).any(1)]
100 loops, best of 3: 2.74 ms per loop
In [110]: %timeit searchsorted_filter(A,B, thresh = 10)
10000 loops, best of 3: 85.3 µs per loop
2018年1月更新,性能进一步提高
我们可以通过使用从np.searchsorted(..., 'left')
获得的索引以及absolute
计算来避免np.searchsorted(..., 'right')
的第二次使用,就像这样-
We can avoid the second usage of np.searchsorted(..., 'right')
by making use of the indices obtained from np.searchsorted(..., 'left')
and also the absolute
computations, like so -
def searchsorted_filter_v2(a, b, thresh):
N = len(b)
choices = np.sort(b) # if b is already sorted, skip it
l = np.searchsorted(choices, a, 'left')
l_invalid_mask = l==N
l[l_invalid_mask] = N-1
left_offset = choices[l]-a
left_offset[l_invalid_mask] *= -1
r = (l - (left_offset!=0))
r_invalid_mask = r<0
r[r_invalid_mask] = 0
r += l_invalid_mask
right_offset = a-choices[r]
right_offset[r_invalid_mask] *= -1
out = a[(left_offset < thresh) | (right_offset < thresh)]
return out
更新了计时以测试进一步的加速-
Updated timings to test the further speedup -
In [388]: np.random.seed(0)
...: A = np.random.randint(0,1000000,(100000))
...: B = np.unique(np.random.randint(0,1000000,(40000)))
...: np.random.shuffle(B)
...: thresh = 10
...:
...: out1 = searchsorted_filter(A, B, thresh)
...: out2 = searchsorted_filter_v2(A, B, thresh)
...: print np.allclose(out1, out2)
True
In [389]: %timeit searchsorted_filter(A, B, thresh)
10 loops, best of 3: 24.2 ms per loop
In [390]: %timeit searchsorted_filter_v2(A, B, thresh)
100 loops, best of 3: 13.9 ms per loop
深入挖掘-
In [396]: a = A; b = B
In [397]: N = len(b)
...:
...: choices = np.sort(b) # if b is already sorted, skip it
...:
...: l = np.searchsorted(choices, a, 'left')
In [398]: %timeit np.sort(B)
100 loops, best of 3: 2 ms per loop
In [399]: %timeit np.searchsorted(choices, a, 'left')
100 loops, best of 3: 10.3 ms per loop
似乎像searchsorted
和sort
几乎占用了所有运行时,它们似乎是此方法必不可少的.因此,使用这种基于排序的方法似乎无法进一步改善它.
Seems like searchsorted
and sort
are taking almost all of the runtime and they seem essential to this method. So, doesn't seem like it could be improved any further staying with this sort-based approach.
这篇关于根据另一个参考数组从一个数组中选择接近匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!