使用标识符列对数组的行进行排序以匹配另一个数组的顺序 [英] sort rows of array to match order of another array using an identifier column

查看:93
本文介绍了使用标识符列对数组的行进行排序以匹配另一个数组的顺序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个这样的数组:

A = [[111, ...],          B = [[222, ...],
     [222, ...],               [111, ...],
     [333, ...],               [333, ...],
     [555, ...]]               [444, ...],
                               [555, ...]]

第一列包含标识符,其余列包含一些数据,其中B的列数比A的列数大得多.标识符是唯一的. A中的行数可以少于B中的行数,因此在某些情况下,必须使用空白的间隔行.
我正在寻找一种将矩阵A的行与矩阵B匹配的有效方法,以使结果看起来像这样:

Where the first column contains identifiers and the remaining columns some data, where the number of columns of B is much larger than the number of columns of A. The identifiers are unique. The number of rows in A can be less than in B, so that in some cases empty spacer rows would be necessary.
I am looking for an efficient way to match the rows of matrix A to matrix B, so that that the result would look like that:

A = [[222, ...],
     [111, ...],
     [333, ...],
     [nan, nan], #could be any unused value
     [555, ...]]

我可以对两个矩阵进行排序或编写for循环,但是这两种方法都显得笨拙……有更好的实现方法吗?

I could just sort both matrices or write a for loop, but both approaches seem clumsy... Are there better implementations?

推荐答案

这是一种使用

Here's a vectorized approach using np.searchsorted -

# Store the sorted indices of A
sidx = A[:,0].argsort()

# Find the indices of col-0 of B in col-0 of sorted A
l_idx = np.searchsorted(A[:,0],B[:,0],sorter = sidx)

# Create a mask corresponding to all those indices that indicates which indices
# corresponding to B's col-0 match up with A's col-0
valid_mask = l_idx != np.searchsorted(A[:,0],B[:,0],sorter = sidx,side='right')

# Initialize output array with NaNs. 
# Use l_idx to set rows from A into output array. Use valid_mask to select 
# indices from l_idx and output rows that are to be set.
out = np.full((B.shape[0],A.shape[1]),np.nan)
out[valid_mask] = A[sidx[l_idx[valid_mask]]]

请注意,也可以使用np.in1d:np.in1d(B[:,0],A[:,0])创建valid_mask,以获得更直观的答案.但是,我们使用np.searchsorted是因为它在性能方面更好,在 this other solution 中也进行了详细讨论.

Please note that valid_mask could also be created using np.in1d : np.in1d(B[:,0],A[:,0]) for a more intuitive answer. But, we are using np.searchsorted as that's better in terms of performance as also disscused in greater detail in this other solution.

样品运行-

In [184]: A
Out[184]: 
array([[45, 11, 86],
       [18, 74, 59],
       [30, 68, 13],
       [55, 47, 78]])

In [185]: B
Out[185]: 
array([[45, 11, 88],
       [55, 83, 46],
       [95, 87, 77],
       [30,  9, 37],
       [14, 97, 98],
       [18, 48, 53]])

In [186]: out
Out[186]: 
array([[ 45.,  11.,  86.],
       [ 55.,  47.,  78.],
       [ nan,  nan,  nan],
       [ 30.,  68.,  13.],
       [ nan,  nan,  nan],
       [ 18.,  74.,  59.]])

这篇关于使用标识符列对数组的行进行排序以匹配另一个数组的顺序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆