从另一个数组中获取匹配项的索引 [英] Get indices of matches from one array in another

查看:50
本文介绍了从另一个数组中获取匹配项的索引的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给定两个 np.arrays;

a = np.array([1, 6, 5, 3, 8, 345, 34, 6, 2, 867])b = np.array([867, 8, 34, 75])

我想得到一个与 b 具有相同维度的 np.array,其中每个值是 b 中的值出现在 a 中的索引,或者 np.nan 如果 b 中的值不存在于 a 中.

结果应该是;

[9, 4, 6, nan]

a 和 b 将始终具有相同的维度数,但维度的大小可能不同.

类似的东西;

(伪代码)

c = np.where(b in a)

但适用于数组(in"不适用)

我更喜欢单线"或者至少是一个完全在数组级别并且不需要循环的解决方案.

谢谢!

解决方案

方法 #1

这是一个带有 np.searchsorted -

def find_indices(a,b,invalid_specifier=-1):# 在 a 的排序版本中搜索每个 b 的匹配索引.# 我们使用 sorter arg 来说明 a 可能未被排序的情况# 在 a 上使用 argsortsidx = a.argsort()idx = np.searchsorted(a,b,sorter=sidx)# 删除越界索引,因为它们不会匹配idx[idx==len(a)] = 0# 获取原始版本对应的回溯索引idx0 = sidx[idx]# 用 invalid_specifier 屏蔽掉无效的并返回返回 np.where(a[idx0]==b, idx0, invalid_specifier)

给定样本的输出 -

In [41]: find_indices(a, b, invalid_specifier=np.nan)Out[41]: 数组([ 9., 4., 6., nan])

方法#2

另一个基于 lookup 的正数 -

def find_indices_lookup(a,b,invalid_specifier=-1):# 设置数组,我们将在其中分配范围数字N = max(a.max(), b.max())+1查找 = np.full(N, invalid_specifier)# 我们用 b 索引查找以追溯位置.不匹配的# 将有 invalid_specifier 值,因为 wount 已被范围索引查找[a] = np.arange(len(a))索引 = 查找 [b]回报指数

基准测试

问题中没有提到效率是一项要求,但可能会出现无循环要求.使用尝试重现给定示例设置的设置进行测试,但将其放大 1000x :

在[98]中:a = np.random.permutation(np.unique(np.random.randint(0,20000,10000)))在 [99] 中:b = np.random.permutation(np.unique(np.random.randint(0,20000,4000)))# 来自这篇文章的解决方案在 [100]: %timeit find_indices(a,b,invalid_specifier=np.nan)...: %timeit find_indices_lookup(a,b,invalid_specifier=np.nan)每个循环 1.35 ms ± 127 µs(7 次运行的平均值 ± 标准偏差,每次 1000 次循环)每个循环 220 µs ± 30.9 µs(7 次运行的平均值 ± 标准偏差,每次 10000 次循环)# @Quang Hoang-soln2在 [101] 中:%%timeit...:commons, idx_a, idx_b = np.intersect1d(a,b, return_indices=True)...: 订单 = np.argsort(idx_b)...: 输出 = np.full(b.shape, np.nan)...:输出[订单] = idx_a[订单]每个循环 1.63 ms ± 59.5 µs(7 次运行的平均值 ± 标准偏差,每次 1000 次循环)# @Quang Hoang-soln1在 [102] 中:%%timeit...: s = b == a[:,None]...: np.where(s.any(0), np.argmax(s,0), np.nan)每个循环 137 ms ± 9.25 ms(7 次运行的平均值 ± 标准偏差,每次 10 次循环)

Given two np.arrays;

a = np.array([1, 6, 5, 3, 8, 345, 34, 6, 2, 867])
b = np.array([867, 8, 34, 75])

I would like to get an np.array with same dimensions as b, where each value is the index where the value in b appears in a, or np.nan if the value in b is not present in a.

The result should be;

[9, 4, 6, nan]

a and b will always have the same number of dimensions, but the sizes of the dimensions may differ.

something like;

(pseudo code)

c = np.where(b in a)

but which works for arrays ("in" does not)

I prefer a "one-liner" or at least a solution that is entirely on array level, and which does not require a loop.

Thx!

解决方案

Approach #1

Here's one with np.searchsorted -

def find_indices(a,b,invalid_specifier=-1):
    # Search for matching indices for each b in sorted version of a. 
    # We use sorter arg to account for the case when a might not be sorted 
    # using argsort on a
    sidx = a.argsort()
    idx = np.searchsorted(a,b,sorter=sidx)

    # Remove out of bounds indices as they wont be matches
    idx[idx==len(a)] = 0

    # Get traced back indices corresponding to original version of a
    idx0 = sidx[idx]
    
    # Mask out invalid ones with invalid_specifier and return
    return np.where(a[idx0]==b, idx0, invalid_specifier)

Output for given sample -

In [41]: find_indices(a, b, invalid_specifier=np.nan)
Out[41]: array([ 9.,  4.,  6., nan])

Approach #2

Another based on lookup for positive numbers -

def find_indices_lookup(a,b,invalid_specifier=-1):
    # Setup array where we will assign ranged numbers
    N = max(a.max(), b.max())+1
    lookup = np.full(N, invalid_specifier)
    
    # We index into lookup with b to trace back the positions. Non matching ones
    # would have invalid_specifier values as wount had been indexed by ranged ones
    lookup[a] = np.arange(len(a))
    indices  = lookup[b]
    return indices

Benchmarking

Efficiency wasn't mentioned as a requirement in the question, but no-loop requirement might go there. Testing out with a setup that tries to reperesent the given sample setup, but scaling it up by 1000x :

In [98]: a = np.random.permutation(np.unique(np.random.randint(0,20000,10000)))

In [99]: b = np.random.permutation(np.unique(np.random.randint(0,20000,4000)))

# Solutions from this post
In [100]: %timeit find_indices(a,b,invalid_specifier=np.nan)
     ...: %timeit find_indices_lookup(a,b,invalid_specifier=np.nan)
1.35 ms ± 127 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
220 µs ± 30.9 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

# @Quang Hoang-soln2
In [101]: %%timeit
     ...: commons, idx_a, idx_b = np.intersect1d(a,b, return_indices=True)
     ...: orders = np.argsort(idx_b)
     ...: output = np.full(b.shape, np.nan)
     ...: output[orders] = idx_a[orders]
1.63 ms ± 59.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# @Quang Hoang-soln1
In [102]: %%timeit
     ...: s = b == a[:,None]
     ...: np.where(s.any(0), np.argmax(s,0), np.nan)
137 ms ± 9.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

这篇关于从另一个数组中获取匹配项的索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆