根据另一个参考数组从一个数组中选择接近匹配 [英] Selecting close matches from one array based on another reference array

查看:27
本文介绍了根据另一个参考数组从一个数组中选择接近匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数组 A 和一个引用数组 B.A 的大小至少和 B 一样大.例如

I have an array A and a reference array B. Size of A is at least as big as B. e.g.

A = [2,100,300,793,1300,1500,1810,2400]
B = [4,305,789,1234,1890]

B 实际上是信号在指定时间的峰值位置,而A 包含稍后时间的峰值位置.但是 A 中的一些元素实际上不是我想要的峰值(可能是由于噪声等),我想在 A 中找到真实"的一个基于B.A中的'real'元素应该与B中的元素接近,在上面给出的例子中,A中的'real'元素应该是 A'=[2,300,793,1300,1810].在这个例子中应该很明显 100,1500,2400 不是我们想要的,因为它们与 B 中的任何元素都相距甚远.我如何以最有效的方式编写代码/python/matlab中的准确方法?

B is in fact the position of peaks in a signal at a specified time, and A contains position of peaks at a later time. But some of the elements in A are actually not the peaks I want (might be due to noise, etc), and I want to find the 'real' one in A based on B. The 'real' elements in A should be close to those in B, and in the example given above, the 'real' ones in A should be A'=[2,300,793,1300,1810]. It should be obvious in this example that 100,1500,2400 are not the ones we want as they are quite far off from any of the elements in B. How can I code this in the most efficient/accurate way in python/matlab?

推荐答案

方法 #1: 使用 NumPy broadcast,我们可以寻找输入数组之间的绝对元素减法,并使用适当的阈值过滤掉不需要的元素来自 A.对于给定的样本输入,90 的阈值似乎有效.

Approach #1: With NumPy broadcasting, we can look for absolute element-wise subtractions between the input arrays and use an appropriate threshold to filter out unwanted elements from A. It seems for the given sample inputs, a threshold of 90 works.

因此,我们将有一个实现,就像这样 -

Thus, we would have an implementation, like so -

thresh = 90
Aout = A[(np.abs(A[:,None] - B) < thresh).any(1)]

样品运行 -

In [69]: A
Out[69]: array([   2,  100,  300,  793, 1300, 1500, 1810, 2400])

In [70]: B
Out[70]: array([   4,  305,  789, 1234, 1890])

In [71]: A[(np.abs(A[:,None] - B) < 90).any(1)]
Out[71]: array([   2,  300,  793, 1300, 1810])

方法#2:基于这篇文章,这是使用 np.searchsorted,这对于大型数组可能是至关重要的 -

Approach #2: Based on this post, here's a memory efficient approach using np.searchsorted, which could be crucial for large arrays -

def searchsorted_filter(a, b, thresh):
    choices = np.sort(b) # if b is already sorted, skip it
    lidx = np.searchsorted(choices, a, 'left').clip(max=choices.size-1)
    ridx = (np.searchsorted(choices, a, 'right')-1).clip(min=0)
    cl = np.take(choices,lidx) # Or choices[lidx]
    cr = np.take(choices,ridx) # Or choices[ridx]
    return a[np.minimum(np.abs(a - cl), np.abs(a - cr)) < thresh]

样品运行 -

In [95]: searchsorted_filter(A,B, thresh = 90)
Out[95]: array([   2,  300,  793, 1300, 1810])

<小时>

运行时测试

In [104]: A = np.sort(np.random.randint(0,100000,(1000)))

In [105]: B = np.sort(np.random.randint(0,100000,(400)))

In [106]: out1 = A[(np.abs(A[:,None] - B) < 10).any(1)]

In [107]: out2 = searchsorted_filter(A,B, thresh = 10)

In [108]: np.allclose(out1, out2)  # Verify results
Out[108]: True

In [109]: %timeit A[(np.abs(A[:,None] - B) < 10).any(1)]
100 loops, best of 3: 2.74 ms per loop

In [110]: %timeit searchsorted_filter(A,B, thresh = 10)
10000 loops, best of 3: 85.3 µs per loop

<小时>

2018 年 1 月更新,进一步提升性能

我们可以通过使用从 np.searchsorted(..., 'left') 以及 absolute 计算,就像这样 -

We can avoid the second usage of np.searchsorted(..., 'right') by making use of the indices obtained from np.searchsorted(..., 'left') and also the absolute computations, like so -

def searchsorted_filter_v2(a, b, thresh):
    N = len(b)

    choices = np.sort(b) # if b is already sorted, skip it

    l = np.searchsorted(choices, a, 'left')
    l_invalid_mask = l==N
    l[l_invalid_mask] = N-1
    left_offset = choices[l]-a
    left_offset[l_invalid_mask] *= -1    

    r = (l - (left_offset!=0))
    r_invalid_mask = r<0
    r[r_invalid_mask] = 0
    r += l_invalid_mask
    right_offset = a-choices[r]
    right_offset[r_invalid_mask] *= -1

    out = a[(left_offset < thresh) | (right_offset < thresh)]
    return out

更新时间以测试进一步的加速 -

Updated timings to test the further speedup -

In [388]: np.random.seed(0)
     ...: A = np.random.randint(0,1000000,(100000))
     ...: B = np.unique(np.random.randint(0,1000000,(40000)))
     ...: np.random.shuffle(B)
     ...: thresh = 10
     ...: 
     ...: out1 = searchsorted_filter(A, B, thresh)
     ...: out2 = searchsorted_filter_v2(A, B, thresh)
     ...: print np.allclose(out1, out2)
True

In [389]: %timeit searchsorted_filter(A, B, thresh)
10 loops, best of 3: 24.2 ms per loop

In [390]: %timeit searchsorted_filter_v2(A, B, thresh)
100 loops, best of 3: 13.9 ms per loop

深入挖掘-

In [396]: a = A; b = B

In [397]: N = len(b)
     ...: 
     ...: choices = np.sort(b) # if b is already sorted, skip it
     ...: 
     ...: l = np.searchsorted(choices, a, 'left')

In [398]: %timeit np.sort(B)
100 loops, best of 3: 2 ms per loop

In [399]: %timeit np.searchsorted(choices, a, 'left')
100 loops, best of 3: 10.3 ms per loop

似乎searchsortedsort 占用了几乎所有的运行时间,而且它们似乎对这种方法必不可少.因此,使用这种基于排序的方法似乎无法进一步改进.

Seems like searchsorted and sort are taking almost all of the runtime and they seem essential to this method. So, doesn't seem like it could be improved any further staying with this sort-based approach.

这篇关于根据另一个参考数组从一个数组中选择接近匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆