查找一组将一个NumPy ndarray的行映射到另一个的索引 [英] Finding a set of indices that maps the rows of one NumPy ndarray to another
问题描述
我有两个结构化的二维numpy
数组,它们在原则上是等于,即
A = numpy.array([[a1,b1,c1],
[a2,b2,c2],
[a3,b3,c3],
[a4,b4,c4]])
B = numpy.array([[a2,b2,c2],
[a4,b4,c4],
[a3,b3,c3],
[a1,b1,c1]])
不是意义上的
numpy.array_equal(A,B) # False
numpy.array_equiv(A,B) # False
numpy.equal(A,B) # ndarray of True and False
但是从某种意义上说,一个数组(A)
是原始,而在另一个(B)
中,数据沿一个轴(可以沿行或列)被随机排序.>
对B
进行排序/混洗以匹配或等于A
或对A
进行排序以使其等于B
的有效方法是什么?只要两个数组都经过改组以相互匹配,则相等检查确实并不重要. A
和B
具有唯一的行.
我尝试过view
方法对两个数组进行排序
def sort2d(A):
A_view = np.ascontiguousarray(A).view(np.dtype((np.void,
A.dtype.itemsize * A.shape[1])))
A_view.sort()
return A_view.view(A.dtype).reshape(-1,A.shape[1])
,但这显然在这里不起作用.需要对非常大的阵列执行此操作,因此性能和可伸缩性至关重要.
根据您的示例,您似乎同时对所有列进行了混洗,因此有一个映射 A→的行索引向量B .这是一个玩具示例:
A = np.random.permutation(12).reshape(4, 3)
idx = np.random.permutation(4)
B = A[idx]
print(repr(A))
# array([[ 7, 11, 6],
# [ 4, 10, 8],
# [ 9, 2, 0],
# [ 1, 3, 5]])
print(repr(B))
# array([[ 1, 3, 5],
# [ 4, 10, 8],
# [ 7, 11, 6],
# [ 9, 2, 0]])
我们要恢复一组索引idx
,例如A[idx] == B
.当且仅当 A 和 B 不包含重复的行时,这才是唯一的映射.
一种有效的方法是找到按词法对 A 中的行进行排序的索引,然后找到 B 中的每一行应位于的排序版本中的位置 A . 一个有用的技巧是查看A
和np.void
dtype将每行视为一个元素:
rowtype = np.dtype((np.void, A.dtype.itemsize * A.size / A.shape[0]))
# A and B must be C-contiguous, might need to force a copy here
a = np.ascontiguousarray(A).view(rowtype).ravel()
b = np.ascontiguousarray(B).view(rowtype).ravel()
a_to_as = np.argsort(a) # indices that sort the rows of A in lexical order
现在,我们可以使用 np.searchsorted
进行二进制搜索,以查找 B 中的每一行应属于 A 的排序版本:
# using the `sorter=` argument rather than `a[a_to_as]` avoids making a copy of `a`
as_to_b = a.searchsorted(b, sorter=a_to_as)
A→B 的映射可以表示为 A→A s s→B
的组合a_to_b = a_to_as.take(as_to_b)
print(np.all(A[a_to_b] == B))
# True
如果 A 和 B 不包含重复的行,则也可以使用
来获得 B→A 的逆映射.b_to_a = np.argsort(a_to_b)
print(np.all(B[b_to_a] == A))
# True
作为单个功能:
def find_row_mapping(A, B):
"""
Given A and B, where B is a copy of A permuted over the first dimension, find
a set of indices idx such that A[idx] == B.
This is a unique mapping if and only if there are no repeated rows in A and B.
Arguments:
A, B: n-dimensional arrays with same shape and dtype
Returns:
idx: vector of indices into the rows of A
"""
if not (A.shape == B.shape):
raise ValueError('A and B must have the same shape')
if not (A.dtype == B.dtype):
raise TypeError('A and B must have the same dtype')
rowtype = np.dtype((np.void, A.dtype.itemsize * A.size / A.shape[0]))
a = np.ascontiguousarray(A).view(rowtype).ravel()
b = np.ascontiguousarray(B).view(rowtype).ravel()
a_to_as = np.argsort(a)
as_to_b = a.searchsorted(b, sorter=a_to_as)
return a_to_as.take(as_to_b)
基准:
In [1]: gen = np.random.RandomState(0)
In [2]: %%timeit A = gen.rand(1000000, 100); B = A.copy(); gen.shuffle(B)
....: find_row_mapping(A, B)
1 loop, best of 3: 2.76 s per loop
*最昂贵的步骤是对行进行快速排序,平均速度为 O(n log n).我不确定是否可以做得更好.
I have two structured 2D numpy
arrays which are equal in principle, meaning
A = numpy.array([[a1,b1,c1],
[a2,b2,c2],
[a3,b3,c3],
[a4,b4,c4]])
B = numpy.array([[a2,b2,c2],
[a4,b4,c4],
[a3,b3,c3],
[a1,b1,c1]])
Not in the sense of
numpy.array_equal(A,B) # False
numpy.array_equiv(A,B) # False
numpy.equal(A,B) # ndarray of True and False
But in the sense that one array (A)
is the original and in the other one (B)
the data is shuffled along one axis (could be along the rows or columns).
What is an efficient way to sort/shuffle B
to match or become equal to A
or alternatively sort A
to become equal to B
? An equality check is indeed not important, as long as both arrays are shuffled to match each other. A
and hence B
have unique rows.
I tried the view
method to sort both the arrays like so
def sort2d(A):
A_view = np.ascontiguousarray(A).view(np.dtype((np.void,
A.dtype.itemsize * A.shape[1])))
A_view.sort()
return A_view.view(A.dtype).reshape(-1,A.shape[1])
but that doesn't work here apparently. This operation needs to be performed for really large arrays, so performance and scalability is critical.
Based on your example, it seems that you have shuffled all of the columns simultaneously, such that there is a vector of row indices that maps A→B. Here's a toy example:
A = np.random.permutation(12).reshape(4, 3)
idx = np.random.permutation(4)
B = A[idx]
print(repr(A))
# array([[ 7, 11, 6],
# [ 4, 10, 8],
# [ 9, 2, 0],
# [ 1, 3, 5]])
print(repr(B))
# array([[ 1, 3, 5],
# [ 4, 10, 8],
# [ 7, 11, 6],
# [ 9, 2, 0]])
We want to recover a set of indices, idx
, such that A[idx] == B
. This will be a unique mapping if and only if A and B contain no repeated rows.
One efficient* approach would be to find the indices that would lexically sort the rows in A, then find where each row in B would fall within the sorted version of A. A useful trick is to view A
and B
as 1D arrays using an np.void
dtype that treats each row as a single element:
rowtype = np.dtype((np.void, A.dtype.itemsize * A.size / A.shape[0]))
# A and B must be C-contiguous, might need to force a copy here
a = np.ascontiguousarray(A).view(rowtype).ravel()
b = np.ascontiguousarray(B).view(rowtype).ravel()
a_to_as = np.argsort(a) # indices that sort the rows of A in lexical order
Now we can use np.searchsorted
to perform a binary search for where each row in B would fall within the sorted version of A:
# using the `sorter=` argument rather than `a[a_to_as]` avoids making a copy of `a`
as_to_b = a.searchsorted(b, sorter=a_to_as)
The mapping from A→B can be expressed as a composite of A→As→B
a_to_b = a_to_as.take(as_to_b)
print(np.all(A[a_to_b] == B))
# True
If A and B contain no repeated rows, the inverse mapping from B→A can also be obtained using
b_to_a = np.argsort(a_to_b)
print(np.all(B[b_to_a] == A))
# True
As a single function:
def find_row_mapping(A, B):
"""
Given A and B, where B is a copy of A permuted over the first dimension, find
a set of indices idx such that A[idx] == B.
This is a unique mapping if and only if there are no repeated rows in A and B.
Arguments:
A, B: n-dimensional arrays with same shape and dtype
Returns:
idx: vector of indices into the rows of A
"""
if not (A.shape == B.shape):
raise ValueError('A and B must have the same shape')
if not (A.dtype == B.dtype):
raise TypeError('A and B must have the same dtype')
rowtype = np.dtype((np.void, A.dtype.itemsize * A.size / A.shape[0]))
a = np.ascontiguousarray(A).view(rowtype).ravel()
b = np.ascontiguousarray(B).view(rowtype).ravel()
a_to_as = np.argsort(a)
as_to_b = a.searchsorted(b, sorter=a_to_as)
return a_to_as.take(as_to_b)
Benchmark:
In [1]: gen = np.random.RandomState(0)
In [2]: %%timeit A = gen.rand(1000000, 100); B = A.copy(); gen.shuffle(B)
....: find_row_mapping(A, B)
1 loop, best of 3: 2.76 s per loop
*The most costly step would be the quicksort over rows which is O(n log n) on average. I'm not sure it's possible to do any better than this.
这篇关于查找一组将一个NumPy ndarray的行映射到另一个的索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!