按NumPy中的出现逐行比较两个矩阵 [英] Comparing two matrices row-wise by occurrence in NumPy
问题描述
假设我有两个NumPy矩阵(或Pandas DataFrames,尽管我猜这在NumPy中会更快).
Suppose I have two NumPy matrices (or Pandas DataFrames, though I'm guessing this will be faster in NumPy).
>>> arr1
array([[3, 1, 4],
[4, 3, 5],
[6, 5, 4],
[6, 5, 4],
[3, 1, 4]])
>>> arr2
array([[3, 1, 4],
[8, 5, 4],
[3, 1, 4],
[6, 5, 4],
[3, 1, 4]])
对于arr1
中的每个行向量,我想对arr2
中该行向量的出现进行计数,并生成一个包含这些计数的向量.因此,对于此示例,结果将是
For every row-vector in arr1
, I want to count the occurrence of that row vector in arr2
and generate a vector of these counts. So for this example, the result would be
[3, 0, 1, 1, 3]
什么是有效的方法?
第一种方法:
仅使用循环遍历arr1
的行向量并在arr2
上生成相应的布尔向量的明显方法似乎很慢.
First approach:
The obvious approach of just using looping over the row-vectors of arr1
and generating a corresponding boolean vector on arr2
seems very slow.
np.apply_along_axis(lambda x: (x == arr2).all(1).sum(), axis=1, arr=arr1)
这似乎是一种错误的算法,因为我必须多次检查同一行.
And it seems like a bad algorithm, as I have to check the same rows multiple times.
第二种方法:我可以将行计数存储在collections.Counter中,然后使用apply_along_axis
进行访问.
Second approach: I could store the row counts in a collections.Counter, and then just access that with apply_along_axis
.
cnter = Counter(tuple(row) for row in arr2)
np.apply_along_axis(lambda x: cnter[tuple(x)], axis=1, arr=arr1)
这似乎要快一些,但是我觉得还必须有一个比这更直接的方法.
This seems to be somewhat faster, but I feel like there has to still be a more direct approach than this.
推荐答案
在将输入转换为一维等效项然后进行排序并使用np.searchsorted
和np.bincount
进行计数之后,这是一种NumPy方法-
Here's a NumPy approach after converting the inputs to 1D equivalents and then sorting and using np.searchsorted
alongwith np.bincount
for the counting -
def searchsorted_based(a,b):
dims = np.maximum(a.max(0), b.max(0))+1
a1D = np.ravel_multi_index(a.T,dims)
b1D = np.ravel_multi_index(b.T,dims)
unq_a1D, IDs = np.unique(a1D, return_inverse=1)
fidx = np.searchsorted(unq_a1D, b1D)
fidx[fidx==unq_a1D.size] = 0
mask = unq_a1D[fidx] == b1D
count = np.bincount(fidx[mask])
out = count[IDs]
return out
样品运行-
In [308]: a
Out[308]:
array([[3, 1, 4],
[4, 3, 5],
[6, 5, 4],
[6, 5, 4],
[3, 1, 4]])
In [309]: b
Out[309]:
array([[3, 1, 4],
[8, 5, 4],
[3, 1, 4],
[6, 5, 4],
[3, 1, 4],
[2, 1, 5]])
In [310]: searchsorted_based(a,b)
Out[310]: array([3, 0, 1, 1, 3])
运行时测试-
In [377]: A = a[np.random.randint(0,a.shape[0],(1000))]
In [378]: B = b[np.random.randint(0,b.shape[0],(1000))]
In [379]: np.allclose(comp2D_vect(A,B), searchsorted_based(A,B))
Out[379]: True
# @Nickil Maveli's soln
In [380]: %timeit comp2D_vect(A,B)
10000 loops, best of 3: 184 µs per loop
In [381]: %timeit searchsorted_based(A,B)
10000 loops, best of 3: 92.6 µs per loop
这篇关于按NumPy中的出现逐行比较两个矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!