使用自定义 dtype 按行对对象数组进行排序 [英] Sorting array of objects by row using custom dtype

查看:58
本文介绍了使用自定义 dtype 按行对对象数组进行排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试按行按字典顺序对一些数组进行排序.整数情况完美地工作:

<预><代码>>>>arr = np.random.choice(10, size=(5, 3))>>>阿尔数组([[1, 0, 2],[8, 0, 8],[1, 8, 4],[1, 3, 9],[6, 1, 8]])>>>np.ndarray(arr.shape[0], dtype=[('', arr.dtype, arr.shape[1])], buffer=arr).sort()>>>阿尔数组([[1, 0, 2],[1, 3, 9],[1, 8, 4],[6, 1, 8],[8, 0, 8]])

我也可以用

进行排序

np.ndarray(arr.shape[0], dtype=[('', arr.dtype)] * arr.shape[1], buffer=arr).sort()

在这两种情况下,结果是相同的.但是,对象数组并非如此:

<预><代码>>>>selection = np.array(list(string.ascii_lowercase), dtype=object)>>>arr = np.random.choice(selection, size=(5, 3))>>>阿尔数组([['t', 'p', 'g'],['n', 's', 'd'],['g', 'g', 'n'],['g', 'h', 'o'],['f', 'j', 'x']], dtype=object)>>>np.ndarray(arr.shape[0], dtype=[('', arr.dtype, arr.shape[1])], buffer=arr).sort()>>>阿尔数组([['t', 'p', 'g'],['n', 's', 'd'],['g', 'h', 'o'],['g', 'g', 'n'],['f', 'j', 'x']], dtype=object)>>>np.ndarray(arr.shape[0], dtype=[('', arr.dtype)] * arr.shape[1], buffer=arr).sort()>>>阿尔数组([['f', 'j', 'x'],['g', 'g', 'n'],['g', 'h', 'o'],['n', 's', 'd'],['t', 'p', 'g']], dtype=object)

显然只有 dtype=[('', arr.dtype)] * arr.shape[1] 的情况才能正常工作.这是为什么?dtype=[('', arr.dtype, arr.shape[1])] 有什么不同?排序显然是在做某事,但乍一看这个顺序似乎是荒谬的.是否使用指针作为排序键?

就其价值而言,np.searchsorted 似乎在进行与 np.sort 相同的比较,正如预期的那样.

解决方案

对整数进行排序的事实恰好是一个巧合,这可以通过查看浮点运算的结果来验证:

<预><代码>>>>arr = np.array([[0.5, 1.0, 10.2],[0.4, 2.0, 11.0],[1.0, 2.0, 4.0]])>>>np.sort(np.ndarray(arr.shape[0], dtype=[('', arr.dtype, arr.shape[1])], buffer=arr))数组([([ 0.5, 1. , 10.2],),([ 1. , 2. , 4. ],),([ 0.4, 2. , 11. ],)], dtype=[('f0', '<f8', (3,))])>>>np.sort(np.ndarray(arr.shape[0], dtype=[('', arr.dtype)] * arr.shape[1], buffer=arr))数组([(0.4, 2., 11. ),(0.5, 1., 10.2),(1., 2., 4.)],dtype=[('f0', '<f8'), ('f1', '<f8'), ('f2', '<f8')])

另一个提示来自查看数字 0.50.41.0 的位:

0.5 = 0x3FE00000000000000.4 = 0x3FD999999999999A1.0 = 0x3FF6666666666666

在小端机器上,我们有 0x00 <0x66 <0x9A(上面显示的最后一个字节在前).

可以通过查看 源代码.例如,在 quicksort.c.src,我们看到所有不是显式数字的类型(包括不是标量的结构字段),都由 npy_quicksort 通用函数.它使用函数 cmp 作为比较器和宏 GENERIC_SWAPGENERIC_COPY 分别进行交换和复制.

函数 cmp 定义为 PyArray_DESCR(arr)->f->比较.宏在 npysort_common.h.

所以最后的结果是,对于任何非标量类型,包括压缩数组结构域,都是逐字节进行比较的.对于对象,这当然是指针的数值.对于浮点数,这将是 IEEE-754 表示.正整数似乎正常工作的事实是由于我的平台使用小端编码这一事实造成的.以二进制补码形式存储的负整数可能不会产生正确的结果.

I am attempting to sort a some arrays lexicographically by rows. The integer case works perfectly:

>>> arr = np.random.choice(10, size=(5, 3))
>>> arr
array([[1, 0, 2],
       [8, 0, 8],
       [1, 8, 4],
       [1, 3, 9],
       [6, 1, 8]])
>>> np.ndarray(arr.shape[0], dtype=[('', arr.dtype, arr.shape[1])], buffer=arr).sort()
>>> arr
array([[1, 0, 2],
       [1, 3, 9],
       [1, 8, 4],
       [6, 1, 8],
       [8, 0, 8]])

I can also do the sorting with

np.ndarray(arr.shape[0], dtype=[('', arr.dtype)] * arr.shape[1], buffer=arr).sort()

In both cases, the results are the same. However, that is not the case for object arrays:

>>> selection = np.array(list(string.ascii_lowercase), dtype=object)
>>> arr = np.random.choice(selection, size=(5, 3))
>>> arr
array([['t', 'p', 'g'],
       ['n', 's', 'd'],
       ['g', 'g', 'n'],
       ['g', 'h', 'o'],
       ['f', 'j', 'x']], dtype=object)
>>> np.ndarray(arr.shape[0], dtype=[('', arr.dtype, arr.shape[1])], buffer=arr).sort()
>>> arr
array([['t', 'p', 'g'],
       ['n', 's', 'd'],
       ['g', 'h', 'o'],
       ['g', 'g', 'n'],
       ['f', 'j', 'x']], dtype=object)
>>> np.ndarray(arr.shape[0], dtype=[('', arr.dtype)] * arr.shape[1], buffer=arr).sort()
>>> arr
array([['f', 'j', 'x'],
       ['g', 'g', 'n'],
       ['g', 'h', 'o'],
       ['n', 's', 'd'],
       ['t', 'p', 'g']], dtype=object)

Clearly only the case with dtype=[('', arr.dtype)] * arr.shape[1] is working properly. Why is that? What is different about dtype=[('', arr.dtype, arr.shape[1])]? The sort is clearly doing something, but the order appears to be nonsensical at first glance. Is it using pointers as the sort keys?

For what it's worth, np.searchsorted appears to be doing the same sort of comparison as np.sort, as expected.

解决方案

The fact that sorting works for integers happens to be a coincidence this can be verified by looking at the result of floating point operations:

>>> arr = np.array([[0.5, 1.0, 10.2],
                    [0.4, 2.0, 11.0],
                    [1.0, 2.0, 4.0]])
>>> np.sort(np.ndarray(arr.shape[0], dtype=[('', arr.dtype, arr.shape[1])], buffer=arr))
array([([ 0.5,  1. , 10.2],),
       ([ 1. ,  2. ,  4. ],),
       ([ 0.4,  2. , 11. ],)], dtype=[('f0', '<f8', (3,))])
>>> np.sort(np.ndarray(arr.shape[0], dtype=[('', arr.dtype)] * arr.shape[1], buffer=arr))
array([(0.4, 2., 11. ),
       (0.5, 1., 10.2),
       (1. , 2.,  4. )],
      dtype=[('f0', '<f8'), ('f1', '<f8'), ('f2', '<f8')])

Another hint comes from looking at the bits of the numbers 0.5, 0.4 and 1.0:

0.5 = 0x3FE0000000000000
0.4 = 0x3FD999999999999A
1.0 = 0x3FF6666666666666

On a little-endian machine, we have that 0x00 < 0x66 < 0x9A (the last byte shown above comes first).

The exact answer can be verified by looking at the sorting functions in the source code. For example, in quicksort.c.src, we see that all types that are not explicitly numerical (including structure fields that are not scalars), are handled by the npy_quicksort generic function. It uses function cmp as a comparator and macros GENERIC_SWAP and GENERIC_COPY to swap and copy, respectively.

The function cmp is defined as PyArray_DESCR(arr)->f->compare. The macros are defined as element-wise operations in npysort_common.h.

So the final result is that for any non-scalar type, including packed array structure fields, the comparison is done byte-by-byte. For objects, this will of course be the numeric values of the pointers. For floats, this will be the IEEE-754 representation. The fact that positive integers appear to work correctly is caused by the fact that my platform uses little-endian encoding. Negative integers stored in twos complement form would likely not yield correct results.

这篇关于使用自定义 dtype 按行对对象数组进行排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆