按NumPy中的出现逐行比较两个矩阵 [英] Comparing two matrices row-wise by occurrence in NumPy

查看:375
本文介绍了按NumPy中的出现逐行比较两个矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有两个NumPy矩阵(或Pandas DataFrames,尽管我猜这在NumPy中会更快).

Suppose I have two NumPy matrices (or Pandas DataFrames, though I'm guessing this will be faster in NumPy).

>>> arr1
array([[3, 1, 4],
       [4, 3, 5],
       [6, 5, 4],
       [6, 5, 4],
       [3, 1, 4]])
>>> arr2
array([[3, 1, 4],
       [8, 5, 4],
       [3, 1, 4],
       [6, 5, 4],
       [3, 1, 4]])

对于arr1中的每个行向量,我想对arr2中该行向量的出现进行计数,并生成一个包含这些计数的向量.因此,对于此示例,结果将是

For every row-vector in arr1, I want to count the occurrence of that row vector in arr2 and generate a vector of these counts. So for this example, the result would be

[3, 0, 1, 1, 3]

什么是有效的方法?

第一种方法: 仅使用循环遍历arr1的行向量并在arr2上生成相应的布尔向量的明显方法似乎很慢.

First approach: The obvious approach of just using looping over the row-vectors of arr1 and generating a corresponding boolean vector on arr2 seems very slow.

np.apply_along_axis(lambda x: (x == arr2).all(1).sum(), axis=1, arr=arr1)

这似乎是一种错误的算法,因为我必须多次检查同一行.

And it seems like a bad algorithm, as I have to check the same rows multiple times.

第二种方法:我可以将行计数存储在collections.Counter中,然后使用apply_along_axis进行访问.

Second approach: I could store the row counts in a collections.Counter, and then just access that with apply_along_axis.

cnter = Counter(tuple(row) for row in arr2)
np.apply_along_axis(lambda x: cnter[tuple(x)], axis=1, arr=arr1)

这似乎要快一些,但是我觉得还必须有一个比这更直接的方法.

This seems to be somewhat faster, but I feel like there has to still be a more direct approach than this.

推荐答案

在将输入转换为一维等效项然后进行排序并使用np.searchsortednp.bincount进行计数之后,这是一种NumPy方法-

Here's a NumPy approach after converting the inputs to 1D equivalents and then sorting and using np.searchsorted alongwith np.bincount for the counting -

def searchsorted_based(a,b):      
    dims = np.maximum(a.max(0), b.max(0))+1

    a1D = np.ravel_multi_index(a.T,dims)
    b1D = np.ravel_multi_index(b.T,dims)

    unq_a1D, IDs = np.unique(a1D, return_inverse=1)
    fidx = np.searchsorted(unq_a1D, b1D)
    fidx[fidx==unq_a1D.size] = 0
    mask = unq_a1D[fidx] == b1D 

    count = np.bincount(fidx[mask])
    out = count[IDs]
    return out

样品运行-

In [308]: a
Out[308]: 
array([[3, 1, 4],
       [4, 3, 5],
       [6, 5, 4],
       [6, 5, 4],
       [3, 1, 4]])

In [309]: b
Out[309]: 
array([[3, 1, 4],
       [8, 5, 4],
       [3, 1, 4],
       [6, 5, 4],
       [3, 1, 4],
       [2, 1, 5]])

In [310]: searchsorted_based(a,b)
Out[310]: array([3, 0, 1, 1, 3])

运行时测试-

In [377]: A = a[np.random.randint(0,a.shape[0],(1000))]

In [378]: B = b[np.random.randint(0,b.shape[0],(1000))]

In [379]: np.allclose(comp2D_vect(A,B), searchsorted_based(A,B))
Out[379]: True

# @Nickil Maveli's soln
In [380]: %timeit comp2D_vect(A,B)
10000 loops, best of 3: 184 µs per loop

In [381]: %timeit searchsorted_based(A,B)
10000 loops, best of 3: 92.6 µs per loop

这篇关于按NumPy中的出现逐行比较两个矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆