查找一组将一个NumPy ndarray的行映射到另一个的索引 [英] Finding a set of indices that maps the rows of one NumPy ndarray to another

查看:118
本文介绍了查找一组将一个NumPy ndarray的行映射到另一个的索引的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个结构化的二维numpy数组,它们在原则上是等于,即

A = numpy.array([[a1,b1,c1],
                 [a2,b2,c2],
                 [a3,b3,c3],
                 [a4,b4,c4]]) 

B = numpy.array([[a2,b2,c2],
                 [a4,b4,c4],
                 [a3,b3,c3],
                 [a1,b1,c1]])

不是意义上的

numpy.array_equal(A,B) # False
numpy.array_equiv(A,B) # False
numpy.equal(A,B) # ndarray of True and False

但是从某种意义上说,一个数组(A)原始,而在另一个(B)中,数据沿一个轴(可以沿行或列)被随机排序.

B进行排序/混洗以匹配或等于A或对A进行排序以使其等于B的有效方法是什么?只要两个数组都经过改组以相互匹配,则相等检查确实并不重要. AB具有唯一的行.

我尝试过view方法对两个数组进行排序

def sort2d(A):
    A_view = np.ascontiguousarray(A).view(np.dtype((np.void,
             A.dtype.itemsize * A.shape[1])))
    A_view.sort()
    return A_view.view(A.dtype).reshape(-1,A.shape[1])   

,但这显然在这里不起作用.需要对非常大的阵列执行此操作,因此性能和可伸缩性至关重要.

解决方案

根据您的示例,您似乎同时对所有列进行了混洗,因此有一个映射 A→的行索引向量B .这是一个玩具示例:

A = np.random.permutation(12).reshape(4, 3)
idx = np.random.permutation(4)
B = A[idx]

print(repr(A))
# array([[ 7, 11,  6],
#        [ 4, 10,  8],
#        [ 9,  2,  0],
#        [ 1,  3,  5]])

print(repr(B))
# array([[ 1,  3,  5],
#        [ 4, 10,  8],
#        [ 7, 11,  6],
#        [ 9,  2,  0]])

我们要恢复一组索引idx,例如A[idx] == B.当且仅当 A B 不包含重复的行时,这才是唯一的映射.


一种有效的方法是找到按词法对 A 中的行进行排序的索引,然后找到 B 中的每一行应位于的排序版本中的位置 A . 一个有用的技巧是查看A作为一维数组,使用np.void dtype将每行视为一个元素:

rowtype = np.dtype((np.void, A.dtype.itemsize * A.size / A.shape[0]))
# A and B must be C-contiguous, might need to force a copy here
a = np.ascontiguousarray(A).view(rowtype).ravel()
b = np.ascontiguousarray(B).view(rowtype).ravel()

a_to_as = np.argsort(a)     # indices that sort the rows of A in lexical order

现在,我们可以使用 np.searchsorted 进行二进制搜索,以查找 B 中的每一行应属于 A 的排序版本:

# using the `sorter=` argument rather than `a[a_to_as]` avoids making a copy of `a`
as_to_b = a.searchsorted(b, sorter=a_to_as)

A→B 的映射可以表示为 A→A s s→B

的组合

a_to_b = a_to_as.take(as_to_b)
print(np.all(A[a_to_b] == B))
# True

如果 A B 不包含重复的行,则也可以使用

来获得 B→A 的逆映射.

b_to_a = np.argsort(a_to_b)
print(np.all(B[b_to_a] == A))
# True

作为单个功能:

def find_row_mapping(A, B):
    """
    Given A and B, where B is a copy of A permuted over the first dimension, find
    a set of indices idx such that A[idx] == B.
    This is a unique mapping if and only if there are no repeated rows in A and B.

    Arguments:
        A, B:   n-dimensional arrays with same shape and dtype
    Returns:
        idx:    vector of indices into the rows of A
    """

    if not (A.shape == B.shape):
        raise ValueError('A and B must have the same shape')
    if not (A.dtype == B.dtype):
        raise TypeError('A and B must have the same dtype')

    rowtype = np.dtype((np.void, A.dtype.itemsize * A.size / A.shape[0]))
    a = np.ascontiguousarray(A).view(rowtype).ravel()
    b = np.ascontiguousarray(B).view(rowtype).ravel()
    a_to_as = np.argsort(a)
    as_to_b = a.searchsorted(b, sorter=a_to_as)

    return a_to_as.take(as_to_b)

基准:

In [1]: gen = np.random.RandomState(0)
In [2]: %%timeit A = gen.rand(1000000, 100); B = A.copy(); gen.shuffle(B)
....: find_row_mapping(A, B)
1 loop, best of 3: 2.76 s per loop


*最昂贵的步骤是对行进行快速排序,平均速度为 O(n log n).我不确定是否可以做得更好.

I have two structured 2D numpy arrays which are equal in principle, meaning

A = numpy.array([[a1,b1,c1],
                 [a2,b2,c2],
                 [a3,b3,c3],
                 [a4,b4,c4]]) 

B = numpy.array([[a2,b2,c2],
                 [a4,b4,c4],
                 [a3,b3,c3],
                 [a1,b1,c1]])

Not in the sense of

numpy.array_equal(A,B) # False
numpy.array_equiv(A,B) # False
numpy.equal(A,B) # ndarray of True and False

But in the sense that one array (A) is the original and in the other one (B) the data is shuffled along one axis (could be along the rows or columns).

What is an efficient way to sort/shuffle B to match or become equal to A or alternatively sort A to become equal to B? An equality check is indeed not important, as long as both arrays are shuffled to match each other. A and hence B have unique rows.

I tried the view method to sort both the arrays like so

def sort2d(A):
    A_view = np.ascontiguousarray(A).view(np.dtype((np.void,
             A.dtype.itemsize * A.shape[1])))
    A_view.sort()
    return A_view.view(A.dtype).reshape(-1,A.shape[1])   

but that doesn't work here apparently. This operation needs to be performed for really large arrays, so performance and scalability is critical.

解决方案

Based on your example, it seems that you have shuffled all of the columns simultaneously, such that there is a vector of row indices that maps A→B. Here's a toy example:

A = np.random.permutation(12).reshape(4, 3)
idx = np.random.permutation(4)
B = A[idx]

print(repr(A))
# array([[ 7, 11,  6],
#        [ 4, 10,  8],
#        [ 9,  2,  0],
#        [ 1,  3,  5]])

print(repr(B))
# array([[ 1,  3,  5],
#        [ 4, 10,  8],
#        [ 7, 11,  6],
#        [ 9,  2,  0]])

We want to recover a set of indices, idx, such that A[idx] == B. This will be a unique mapping if and only if A and B contain no repeated rows.


One efficient* approach would be to find the indices that would lexically sort the rows in A, then find where each row in B would fall within the sorted version of A. A useful trick is to view A and B as 1D arrays using an np.void dtype that treats each row as a single element:

rowtype = np.dtype((np.void, A.dtype.itemsize * A.size / A.shape[0]))
# A and B must be C-contiguous, might need to force a copy here
a = np.ascontiguousarray(A).view(rowtype).ravel()
b = np.ascontiguousarray(B).view(rowtype).ravel()

a_to_as = np.argsort(a)     # indices that sort the rows of A in lexical order

Now we can use np.searchsorted to perform a binary search for where each row in B would fall within the sorted version of A:

# using the `sorter=` argument rather than `a[a_to_as]` avoids making a copy of `a`
as_to_b = a.searchsorted(b, sorter=a_to_as)

The mapping from A→B can be expressed as a composite of A→As→B

a_to_b = a_to_as.take(as_to_b)
print(np.all(A[a_to_b] == B))
# True

If A and B contain no repeated rows, the inverse mapping from B→A can also be obtained using

b_to_a = np.argsort(a_to_b)
print(np.all(B[b_to_a] == A))
# True

As a single function:

def find_row_mapping(A, B):
    """
    Given A and B, where B is a copy of A permuted over the first dimension, find
    a set of indices idx such that A[idx] == B.
    This is a unique mapping if and only if there are no repeated rows in A and B.

    Arguments:
        A, B:   n-dimensional arrays with same shape and dtype
    Returns:
        idx:    vector of indices into the rows of A
    """

    if not (A.shape == B.shape):
        raise ValueError('A and B must have the same shape')
    if not (A.dtype == B.dtype):
        raise TypeError('A and B must have the same dtype')

    rowtype = np.dtype((np.void, A.dtype.itemsize * A.size / A.shape[0]))
    a = np.ascontiguousarray(A).view(rowtype).ravel()
    b = np.ascontiguousarray(B).view(rowtype).ravel()
    a_to_as = np.argsort(a)
    as_to_b = a.searchsorted(b, sorter=a_to_as)

    return a_to_as.take(as_to_b)

Benchmark:

In [1]: gen = np.random.RandomState(0)
In [2]: %%timeit A = gen.rand(1000000, 100); B = A.copy(); gen.shuffle(B)
....: find_row_mapping(A, B)
1 loop, best of 3: 2.76 s per loop


*The most costly step would be the quicksort over rows which is O(n log n) on average. I'm not sure it's possible to do any better than this.

这篇关于查找一组将一个NumPy ndarray的行映射到另一个的索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆