查找大阵的索引,如果它包含在较小的数组值 [英] Find indices of large array if it contains values in smaller array

查看:128
本文介绍了查找大阵的索引,如果它包含在较小的数组值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有一个更大的数组,其中它从一个较小的阵列相匹配的值返回指数列表快速 numpy的功能?小阵〜30M价值和较大的是800M的,所以我想避免 numpy.where 呼叫的for循环。

Is there a fast numpy function for returning a list of indices in a larger array where it matches values from a smaller array? Smaller array is ~ 30M values and larger is 800M so I want to avoid a for-loop of numpy.where calls.

与searchsorted的问题是,它会返回结果,即使他们是不完全匹配,它只是给最接近的索引,但我只想在有完全匹配指数

The problem with searchsorted is that it will return results even when their is not an exact match, it just gives the closest index, but I only want indices where there are exact matches

而不是这样的:

>>> a = array([1,2,3,4,5])
>>> b = array([2,4,7])
>>> searchsorted(a,b)
array([1, 3, 5])

我想这样的:

>>> a = array([1,2,3,4,5])
>>> b = array([2,4,7])
>>> SOMEFUNCTION(a,b)
array([1, 3])

编辑:该组中的两个较小的和较大的阵列值始终是唯一的和排序

the set of values in both the smaller and larger arrays are always unique and sorted.

推荐答案

您可以使用的 np.in1d​​ 找到 A 这是在<$ C的那些元素$ C> b 。
要查找索引,使用一个调用 np.where

You could use np.in1d to find those elements of a which are in b. To find the index, use one call to np.where:

In [34]: a = array([1,2,3,4,5])

In [35]: b = array([2,4,7])

In [36]: np.in1d(a, b)
Out[38]: array([False,  True, False,  True, False], dtype=bool)

In [39]: np.where(np.in1d(a, b))
Out[39]: (array([1, 3]),)

由于 A B 已经排序,你可以使用

Because a and b are already sorted, you could use

In [57]: np.searchsorted(b, a, side='right') != np.searchsorted(b, a, side='left')
Out[57]: array([False,  True, False,  True, False], dtype=bool)

而不是 np.in1d​​(A,B)。对于大型 A B ,使用 searchsorted 可能会更快

instead of np.in1d(a, b). For large a and b, using searchsorted may be faster:

import numpy as np
a = np.random.choice(10**7, size=10**6, replace=False)
a.sort()
b = np.random.choice(10**7, size=10**5, replace=False)
b.sort()

In [53]: %timeit np.in1d(a, b)
10 loops, best of 3: 176 ms per loop

In [54]: %timeit np.searchsorted(b, a, side='right') != np.searchsorted(b, a, side='left')
10 loops, best of 3: 106 ms per loop


<一个href=\"http://stackoverflow.com/questions/31789187/find-indices-of-large-array-if-it-contains-values-in-smaller-array/31789252#comment51510477_31790052\">Jaime和 Divakar 建议上面所示的方法有些显著的改善。下面是一些code,它测试的方法都返回相同的结果,其次是一些基准测试:


Jaime and Divakar have suggested some significant improvements on the method shown above. Here is some code which tests that the methods all return the same result, followed by some benchmarks:

import numpy as np

a = np.random.choice(10**7, size=10**6, replace=False)
a.sort()
b = np.random.choice(10**7, size=10**5, replace=False)
b.sort()

def using_searchsorted(a, b):
    return (np.where(np.searchsorted(b, a, side='right') 
                     != np.searchsorted(b, a, side='left')))[0]

def using_in1d(a, b):
    return np.where(np.in1d(a, b))[0]

def using_searchsorted_divakar(a, b):
    idx1 = np.searchsorted(a,b,'left')
    idx2 = np.searchsorted(a,b,'right')
    out = idx1[idx1 != idx2]
    return out

def using_jaime_mask(haystack, needle):
    idx = np.searchsorted(haystack, needle)
    mask = idx < haystack.size
    mask[mask] = haystack[idx[mask]] == needle[mask]
    idx = idx[mask]
    return idx

expected = using_searchsorted(a, b)
for func in (using_in1d, using_searchsorted_divakar, using_jaime_mask):
    result = func(a, b)
    assert np.allclose(expected, result)


In [29]: %timeit using_jaime_mask(a, b)
100 loops, best of 3: 13 ms per loop

In [28]: %timeit using_searchsorted_divakar(a, b)
10 loops, best of 3: 21.7 ms per loop

In [26]: %timeit using_searchsorted(a, b)
10 loops, best of 3: 109 ms per loop

In [27]: %timeit using_in1d(a, b)
10 loops, best of 3: 173 ms per loop

这篇关于查找大阵的索引,如果它包含在较小的数组值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆