根据另一个参考数组从一个数组中选择接近匹配 [英] Selecting close matches from one array based on another reference array

查看:58
本文介绍了根据另一个参考数组从一个数组中选择接近匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数组A和一个引用数组B. A的大小至少与B一样大.例如

I have an array A and a reference array B. Size of A is at least as big as B. e.g.

A = [2,100,300,793,1300,1500,1810,2400]
B = [4,305,789,1234,1890]

实际上,

B是信号在指定时间的峰值位置,而A包含稍后时间的峰值位置.但是A中的某些元素实际上不是我想要的峰(可能是由于噪声等引起的),我想根据BA中找到真实"峰. A中的'real'元素应该与B中的元素接近,并且在上面给出的示例中,A中的'real'元素应该为A'=[2,300,793,1300,1810].在此示例中应该显而易见,100,1500,2400不是我们想要的,因为它们与B中的任何元素都相距甚远.如何在python/matlab中以最有效/准确的方式对此进行编码?

B is in fact the position of peaks in a signal at a specified time, and A contains position of peaks at a later time. But some of the elements in A are actually not the peaks I want (might be due to noise, etc), and I want to find the 'real' one in A based on B. The 'real' elements in A should be close to those in B, and in the example given above, the 'real' ones in A should be A'=[2,300,793,1300,1810]. It should be obvious in this example that 100,1500,2400 are not the ones we want as they are quite far off from any of the elements in B. How can I code this in the most efficient/accurate way in python/matlab?

推荐答案

方法1:使用

Approach #1: With NumPy broadcasting, we can look for absolute element-wise subtractions between the input arrays and use an appropriate threshold to filter out unwanted elements from A. It seems for the given sample inputs, a threshold of 90 works.

这样,我们将有一个实现,像这样-

Thus, we would have an implementation, like so -

thresh = 90
Aout = A[(np.abs(A[:,None] - B) < thresh).any(1)]

样品运行-

In [69]: A
Out[69]: array([   2,  100,  300,  793, 1300, 1500, 1810, 2400])

In [70]: B
Out[70]: array([   4,  305,  789, 1234, 1890])

In [71]: A[(np.abs(A[:,None] - B) < 90).any(1)]
Out[71]: array([   2,  300,  793, 1300, 1810])

方法2::基于 this post ,这是一种使用内存的有效方法 np.searchsorted ,对于大型数组-

Approach #2: Based on this post, here's a memory efficient approach using np.searchsorted, which could be crucial for large arrays -

def searchsorted_filter(a, b, thresh):
    choices = np.sort(b) # if b is already sorted, skip it
    lidx = np.searchsorted(choices, a, 'left').clip(max=choices.size-1)
    ridx = (np.searchsorted(choices, a, 'right')-1).clip(min=0)
    cl = np.take(choices,lidx) # Or choices[lidx]
    cr = np.take(choices,ridx) # Or choices[ridx]
    return a[np.minimum(np.abs(a - cl), np.abs(a - cr)) < thresh]

样品运行-

In [95]: searchsorted_filter(A,B, thresh = 90)
Out[95]: array([   2,  300,  793, 1300, 1810])


运行时测试

In [104]: A = np.sort(np.random.randint(0,100000,(1000)))

In [105]: B = np.sort(np.random.randint(0,100000,(400)))

In [106]: out1 = A[(np.abs(A[:,None] - B) < 10).any(1)]

In [107]: out2 = searchsorted_filter(A,B, thresh = 10)

In [108]: np.allclose(out1, out2)  # Verify results
Out[108]: True

In [109]: %timeit A[(np.abs(A[:,None] - B) < 10).any(1)]
100 loops, best of 3: 2.74 ms per loop

In [110]: %timeit searchsorted_filter(A,B, thresh = 10)
10000 loops, best of 3: 85.3 µs per loop


2018年1月更新,性能进一步提高

我们可以通过使用从np.searchsorted(..., 'left')获得的索引以及absolute计算来避免np.searchsorted(..., 'right')的第二次使用,就像这样-

We can avoid the second usage of np.searchsorted(..., 'right') by making use of the indices obtained from np.searchsorted(..., 'left') and also the absolute computations, like so -

def searchsorted_filter_v2(a, b, thresh):
    N = len(b)

    choices = np.sort(b) # if b is already sorted, skip it

    l = np.searchsorted(choices, a, 'left')
    l_invalid_mask = l==N
    l[l_invalid_mask] = N-1
    left_offset = choices[l]-a
    left_offset[l_invalid_mask] *= -1    

    r = (l - (left_offset!=0))
    r_invalid_mask = r<0
    r[r_invalid_mask] = 0
    r += l_invalid_mask
    right_offset = a-choices[r]
    right_offset[r_invalid_mask] *= -1

    out = a[(left_offset < thresh) | (right_offset < thresh)]
    return out

更新了计时以测试进一步的加速-

Updated timings to test the further speedup -

In [388]: np.random.seed(0)
     ...: A = np.random.randint(0,1000000,(100000))
     ...: B = np.unique(np.random.randint(0,1000000,(40000)))
     ...: np.random.shuffle(B)
     ...: thresh = 10
     ...: 
     ...: out1 = searchsorted_filter(A, B, thresh)
     ...: out2 = searchsorted_filter_v2(A, B, thresh)
     ...: print np.allclose(out1, out2)
True

In [389]: %timeit searchsorted_filter(A, B, thresh)
10 loops, best of 3: 24.2 ms per loop

In [390]: %timeit searchsorted_filter_v2(A, B, thresh)
100 loops, best of 3: 13.9 ms per loop

深入挖掘-

In [396]: a = A; b = B

In [397]: N = len(b)
     ...: 
     ...: choices = np.sort(b) # if b is already sorted, skip it
     ...: 
     ...: l = np.searchsorted(choices, a, 'left')

In [398]: %timeit np.sort(B)
100 loops, best of 3: 2 ms per loop

In [399]: %timeit np.searchsorted(choices, a, 'left')
100 loops, best of 3: 10.3 ms per loop

似乎像searchsortedsort几乎占用了所有运行时,它们似乎是此方法必不可少的.因此,使用这种基于排序的方法似乎无法进一步改善它.

Seems like searchsorted and sort are taking almost all of the runtime and they seem essential to this method. So, doesn't seem like it could be improved any further staying with this sort-based approach.

这篇关于根据另一个参考数组从一个数组中选择接近匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆