在Python中的两个列表/数组中查找最近的项目 [英] finding nearest items across two lists/arrays in Python

查看:101
本文介绍了在Python中的两个列表/数组中查找最近的项目的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个包含浮点值的numpy数组xy.对于x中的每个值,我想找到y中最接近的元素,而无需重用y中的元素.输出应该是元素x的索引到y元素的索引的1-1映射.这是一种依靠排序的糟糕方法.它从列表中删除配对的每个元素.如果不进行排序,那将是不好的,因为配对将取决于原始输入数组的顺序.

I have two numpy arrays x and y containing float values. For each value in x, I want to find the closest element in y, without reusing elements from y. The output should be a 1-1 mapping of indices of elements from x to indices of elements from y. Here's a bad way to do it that relies on sorting. It removes each element that was paired from the list. Without sorting this would be bad because the pairing would depend on the order of the original input arrays.

def min_i(values):
    min_index, min_value = min(enumerate(values),
                               key=operator.itemgetter(1))
    return min_index, min_value

# unsorted elements
unsorted_x = randn(10)*10
unsorted_y = randn(10)*10

# sort lists
x = sort(unsorted_x)
y = sort(unsorted_y)

pairs = []
indx_to_search = range(len(y))

for x_indx, x_item in enumerate(x):
    if len(indx_to_search) == 0:
        print "ran out of items to match..."
        break
    # until match is found look for closest item
    possible_values = y[indx_to_search]
    nearest_indx, nearest_item = min_i(possible_values)
    orig_indx = indx_to_search[nearest_indx]
    # remove it
    indx_to_search.remove(orig_indx)
    pairs.append((x_indx, orig_indx))
print "paired items: "
for k,v in pairs:
    print x[k], " paired with ", y[v]

我更喜欢这样做,而不先对元素进行排序,但是如果对元素进行了排序,那么我想获取原始的,未排序的列表unsorted_xunsorted_y中的索引.在numpy/scipy/Python或使用pandas的最佳方式是什么?谢谢.

I prefer to do it without sorting the elements first, but if they are sorted then I want to get the indices in the original, unsorted lists unsorted_x, unsorted_y. what's the best way to do this in numpy/scipy/Python or using pandas? thanks.

编辑:澄清一下,我并不是想在所有元素之间找到最佳拟合(例如,不使距离之和最小),而是每个元素的最佳拟合,如果可以的话,也可以有时以其他要素为代价.我假设y通常比x大得多,与上面的示例相反,因此对于y中的每个x值通常都有很多很好的拟合,我只想高效地找到一个.

edit: to clarify I'm not trying to find the best fit across all elemets (not minimizing sum of distances for example) but rather the best fit for each element, and it's okay if it's sometimes at the expense of other elements. I assume that y is generally much larger than x contrary to above example and so there's usually many very good fits for each value of x in y, and I just want to find that one efficiently.

有人可以为此显示一个scipy kdtrees示例吗?文档相当稀疏

can someone show an example of scipy kdtrees for this? The docs are quite sparse

kdtree = scipy.spatial.cKDTree([x,y])
kdtree.query([-3]*10) # ?? unsure about what query takes as arg

推荐答案

编辑2 如果您选择了多个可以保证您拥有邻居的邻居,那么使用KDTree的解决方案将表现出色.数组中每个项目的唯一邻居.使用以下代码:

EDIT 2 A solution using KDTree can perform very well if you can choose a number of neighbors that guarantees that you will have a unique neighbor for every item in your array. With the following code:

def nearest_neighbors_kd_tree(x, y, k) :
    x, y = map(np.asarray, (x, y))
    tree =scipy.spatial.cKDTree(y[:, None])    
    ordered_neighbors = tree.query(x[:, None], k)[1]
    nearest_neighbor = np.empty((len(x),), dtype=np.intp)
    nearest_neighbor.fill(-1)
    used_y = set()
    for j, neigh_j in enumerate(ordered_neighbors) :
        for k in neigh_j :
            if k not in used_y :
                nearest_neighbor[j] = k
                used_y.add(k)
                break
    return nearest_neighbor

n=1000点的样本,我得到:

In [9]: np.any(nearest_neighbors_kd_tree(x, y, 12) == -1)
Out[9]: True

In [10]: np.any(nearest_neighbors_kd_tree(x, y, 13) == -1)
Out[10]: False

因此最佳值为k=13,然后计时为:

So the optimum is k=13, and then the timing is:

In [11]: %timeit nearest_neighbors_kd_tree(x, y, 13)
100 loops, best of 3: 9.26 ms per loop

但是在最坏的情况下,您可能需要k=1000,然后:

But in the worst case, you could need k=1000, and then:

In [12]: %timeit nearest_neighbors_kd_tree(x, y, 1000)
1 loops, best of 3: 424 ms per loop

哪个比其他选项慢?

In [13]: %timeit nearest_neighbors(x, y)
10 loops, best of 3: 60 ms per loop

In [14]: %timeit nearest_neighbors_sorted(x, y)
10 loops, best of 3: 47.4 ms per loop


编辑:在搜索之前对数组进行排序会得到1000个以上的数组的结果:


EDIT Sorting the array before searching pays off for arrays of more than 1000 items:

def nearest_neighbors_sorted(x, y) :
    x, y = map(np.asarray, (x, y))
    y_idx = np.argsort(y)
    y = y[y_idx]
    nearest_neighbor = np.empty((len(x),), dtype=np.intp)
    for j, xj in enumerate(x) :
        idx = np.searchsorted(y, xj)
        if idx == len(y) or idx != 0 and y[idx] - xj > xj - y[idx-1] :
            idx -= 1
        nearest_neighbor[j] = y_idx[idx]
        y = np.delete(y, idx)
        y_idx = np.delete(y_idx, idx)
    return nearest_neighbor

具有10000个元素的长数组:

With a 10000 element long array:

In [2]: %timeit nearest_neighbors_sorted(x, y)
1 loops, best of 3: 557 ms per loop

In [3]: %timeit nearest_neighbors(x, y)
1 loops, best of 3: 1.53 s per loop

对于较小的数组,其性能稍差.

For smaller arrays it performs slightly worse.

您将必须遍历所有项以实现您的贪婪最近邻居算法,如果只是为了丢弃重复项.考虑到这一点,这是我能想到的最快的方法:

You are going to have to loop over all your items to implement your greedy nearest neighbor algorithm, if only to discard duplicates. With that in mind, this is the fastest I have been able to come up with:

def nearest_neighbors(x, y) :
    x, y = map(np.asarray, (x, y))
    y = y.copy()
    y_idx = np.arange(len(y))
    nearest_neighbor = np.empty((len(x),), dtype=np.intp)
    for j, xj in enumerate(x) :
        idx = np.argmin(np.abs(y - xj))
        nearest_neighbor[j] = y_idx[idx]
        y = np.delete(y, idx)
        y_idx = np.delete(y_idx, idx)

    return nearest_neighbor

现在使用:

n = 1000
x = np.random.rand(n)
y = np.random.rand(2*n)

我得到:

In [11]: %timeit nearest_neighbors(x, y)
10 loops, best of 3: 52.4 ms per loop

这篇关于在Python中的两个列表/数组中查找最近的项目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆