向量化为另一个数组中的每个元素在数组中找到最接近的值 [英] Vectorize finding closest value in an array for each element in another array

查看：83 发布时间：2021/4/2 20:23:26 python algorithm numpy vectorization cython

本文介绍了向量化为另一个数组中的每个元素在数组中找到最接近的值的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

known_array :numpy数组；仅由标量值组成； shape:(m，1)

known_array : numpy array; consisting of scalar values only; shape: (m, 1)

test_array :numpy数组；仅由标量值组成； shape:(n，1)

test_array : numpy array; consisting of scalar values only; shape: (n, 1)

indices :numpy数组； shape:(n，1);对于 test_array 中的每个值，找到 known_array

indices : numpy array; shape: (n, 1); For each value in test_array finds the index of the closest value in known_array

residual :numpy数组； shape:(n，1);对于 test_array 中的每个值，查找与 known_array

residual : numpy array; shape: (n, 1); For each value in test_array finds the difference from the closest value in known_array

In [17]: known_array = np.array([random.randint(-30,30) for i in range(5)])

In [18]: known_array
Out[18]: array([-24, -18, -13, -30,  29])

In [19]: test_array = np.array([random.randint(-10,10) for i in range(10)])

In [20]: test_array
Out[20]: array([-6,  4, -6,  4,  8, -4,  8, -6,  2,  8])

示例实现(未完全向量化)

def find_nearest(known_array, value):
    idx = (np.abs(known_array - value)).argmin()
    diff = known_array[idx] - value
    return [idx, -diff]

In [22]: indices = np.zeros(len(test_array))

In [23]: residual = np.zeros(len(test_array))

In [24]: for i in range(len(test_array)):
   ....:     [indices[i], residual[i]] =  find_nearest(known_array, test_array[i])
   ....:     

In [25]: indices
Out[25]: array([ 2.,  2.,  2.,  2.,  2.,  2.,  2.,  2.,  2.,  2.])

In [26]: residual
Out[26]: array([  7.,  17.,   7.,  17.,  21.,   9.,  21.,   7.,  15.,  21.])

加快此任务的最佳方法是什么?Cython是一个选项，但是，我始终希望能够删除 for 循环，并让代码保留为纯NumPy.

What is the best way to speed up this task? Cython is an option, but, I would always prefer to be able to remove the for loop and let the code remain are pure NumPy.

NB :已咨询以下堆栈溢出问题

NB: Following Stack Overflow questions were consulted

更新

我为比较非矢量化和矢量化解决方案做了一些小型基准测试(可接受的答案).

Updates

I did some small benchmarks for comparing the non-vectorized and vectorized solution (accepted answer).

In [48]: [indices1, residual1] = find_nearest_vectorized(known_array, test_array)

In [53]: [indices2, residual2] = find_nearest_non_vectorized(known_array, test_array)

In [54]: indices1==indices2
Out[54]: array([ True,  True,  True,  True,  True,  True,  True,  True,  True,  True],   dtype=bool)

In [55]: residual1==residual2
Out[55]: array([ True,  True,  True,  True,  True,  True,  True,  True,  True,  True], dtype=bool)

In [56]: %timeit [indices2, residual2] = find_nearest_non_vectorized(known_array, test_array)
10000 loops, best of 3: 173 µs per loop

In [57]: %timeit [indices1, residual1] = find_nearest_vectorized(known_array, test_array)
100000 loops, best of 3: 16.8 µs per loop

大约提高了 10倍！

known_array 未排序.

我按照下面@cyborg的答案运行了基准测试.

I ran the benchmarks as given in the answer by @cyborg below.

案例1 :如果对 known_array 进行了排序

known_array = np.arange(0,1000)
test_array = np.random.randint(0, 100, 10000)
print('Speedups:')
base_time = time_f('base')
for func_name in ['diffs', 'searchsorted1', 'searchsorted2']:
    print func_name + ' is x%.1f faster than base.' % (base_time / time_f(func_name))
    assert np.allclose(base(known_array, test_array), eval(func_name+'(known_array, test_array)'))

Speedups:
diffs is x0.4 faster than base.
searchsorted1 is x81.3 faster than base.
searchsorted2 is x107.6 faster than base.

首先，对于大型数组， diffs 方法实际上要慢一些，它还会占用大量RAM，并且在实际数据上运行时系统会挂起.

Firstly, for large arrays diffs method is actually slower, it also eats up a lot of RAM and my system hanged when I ran it on actual data.

案例2 :未对 known_array 进行排序时；代表实际情况

Case 2 : When known_array is not sorted; which represents actual scenario

known_array = np.random.randint(0,100,100)
test_array = np.random.randint(0, 100, 100)

Speedups:
diffs is x8.9 faster than base.
AssertionError                            Traceback (most recent call last)
<ipython-input-26-3170078c217a> in <module>()
      5 for func_name in ['diffs', 'searchsorted1', 'searchsorted2']:
      6     print func_name + ' is x%.1f faster than base.' % (base_time /  time_f(func_name))
----> 7     assert np.allclose(base(known_array, test_array),  eval(func_name+'(known_array, test_array)'))

AssertionError: 


searchsorted1 is x14.8 faster than base.

我还必须评论说，该方法还应该具有存储效率.否则我的8 GB RAM不足.在基本情况下，就足够了.

I must also comment that the approach should also be memory efficient. Otherwise my 8 GB of RAM is not sufficient. In the base case, it is easily sufficient.

向量化为另一个数组中的每个元素在数组中找到最接近的值 [英] Vectorize finding closest value in an array for each element in another array

问题描述

示例实现(未完全向量化)

更新

Updates

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

向量化为另一个数组中的每个元素在数组中找到最接近的值 [英] Vectorize finding closest value in an array for each element in another array

问题描述

示例实现(未完全向量化)

更新

Updates

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭