在Python中的多列上排序numpy数组 [英] Sorting numpy array on multiple columns in Python

查看:78
本文介绍了在Python中的多列上排序numpy数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试对column1上的以下数组进行排序,然后对column2然后对column3进行排序

I am trying to sort the following array on column1, then column2 and then column3

[['2008' '1' '23' 'AAPL' 'Buy' '100']
 ['2008' '1' '30' 'AAPL' 'Sell' '100']
 ['2008' '1' '23' 'GOOG' 'Buy' '100']
 ['2008' '1' '30' 'GOOG' 'Sell' '100']
 ['2008' '9' '8' 'GOOG' 'Buy' '100']
 ['2008' '9' '15' 'GOOG' 'Sell' '100']
 ['2008' '5' '1' 'XOM' 'Buy' '100']
 ['2008' '5' '8' 'XOM' 'Sell' '100']]

我使用了以下代码:

    idx=np.lexsort((order_array[:,2],order_array[:,1],order_array[:,0]))
    order_array=order_array[idx]

结果数组为

[['2008' '1' '23' 'AAPL' 'Buy' '100']
 ['2008' '1' '23' 'GOOG' 'Buy' '100']
 ['2008' '1' '30' 'AAPL' 'Sell' '100']
 ['2008' '1' '30' 'GOOG' 'Sell' '100']
 ['2008' '5' '1' 'XOM' 'Buy' '100']
 ['2008' '5' '8' 'XOM' 'Sell' '100']
 ['2008' '9' '15' 'GOOG' 'Sell' '100']
 ['2008' '9' '8' 'GOOG' 'Buy' '100']]

问题是最后两行是错误的.正确的数组应将最后一行作为倒数第二个.我已经尝试了所有方法,但无法理解为什么会这样.将不胜感激.

The problem is that the last two rows are wrong. The correct array should have the last row as the second last one. I have tried everything but am not able to understand why this is happening. Will appreciate some help.

我正在使用以下代码获取order_array.

I am using the following code for obtaining order_array.

 for i in ….
    x= ldt_timestamps[i] # this is a list of timestamps
    s_sym=……
    list=[int(x.year),int(x.month),int(x.day),s_sym,'Buy',100]   
    rows_list.append(list) 

 order_array=np.array(rows_list)

推荐答案

tldr:在对数值数组进行数值计算时,NumPy会发光.尽管有可能(请参阅下文),但NumPy不太适合此操作.您最好使用Pandas.

tldr: NumPy shines when doing numerical calculations on numerical arrays. Although it is possible (see below) NumPy is not well suited for this. You're probably better off using Pandas.

问题原因:

这些值正在按字符串的形式 进行排序.您需要将它们排序为ints.

The values are being sorted as strings. You need to sort them as ints.

In [7]: sorted(['15', '8'])
Out[7]: ['15', '8']

In [8]: sorted([15, 8])
Out[8]: [8, 15]

发生这种情况是因为order_array包含字符串.您需要在适当的地方将这些字符串转换为ints.

This happened because order_array contains strings. You need to convert those strings to ints where appropriate.

将dtype从string-dtype转换为数字dtype需要为新数组分配空间.因此,最好从一开始就修改创建order_array的方式.

Converting dtypes from string-dtype to numerical dtype requires allocating space for a new array. Therefore, you would probably be better off revising the way you are creating order_array from the beginning.

有趣的是,即使您在调用时将值转换为ints

Interestingly, even though you converted the values to ints, when you call

order_array = np.array(rows_list)

NumPy默认情况下会创建一个齐次数组.在齐次数组中,每个值都具有相同的dtype.所以NumPy试图在您所有的人中找到共同点 值并选择了字符串dtype,从而阻止了您将字符串转换为int的工作!

NumPy by default creates a homogenous array. In a homogeneous array every value has a same dtype. So NumPy tried to find the common denominator among all your values and chose a string dtype, thwarting the effort you put into converting the strings to ints!

您可以通过检查order_array.dtype来自己检查dtype:

You can check the dtype for yourself by inspecting order_array.dtype:

In [42]: order_array = np.array(rows_list)

In [43]: order_array.dtype
Out[43]: dtype('|S4')

现在,我们该如何解决呢?

Now, how do we fix this?

使用对象dtype:

最简单的方法是使用'object'dtype

The simplest way is to use an 'object' dtype

In [53]: order_array = np.array(rows_list, dtype='object')

In [54]: order_array
Out[54]: 
array([[2008, 1, 23, AAPL, Buy, 100],
       [2008, 1, 30, AAPL, Sell, 100],
       [2008, 1, 23, GOOG, Buy, 100],
       [2008, 1, 30, GOOG, Sell, 100],
       [2008, 9, 8, GOOG, Buy, 100],
       [2008, 9, 15, GOOG, Sell, 100],
       [2008, 5, 1, XOM, Buy, 100],
       [2008, 5, 8, XOM, Sell, 100]], dtype=object)

这里的问题是np.lexsortnp.sort不适用于以下数组 dtype object.要解决该问题,您可以对rows_list进行排序 在创建order_list之前:

The problem here is that np.lexsort or np.sort do not work on arrays of dtype object. To get around that problem, you could sort the rows_list before creating order_list:

In [59]: import operator

In [60]: rows_list.sort(key=operator.itemgetter(0,1,2))
Out[60]: 
[(2008, 1, 23, 'AAPL', 'Buy', 100),
 (2008, 1, 23, 'GOOG', 'Buy', 100),
 (2008, 1, 30, 'AAPL', 'Sell', 100),
 (2008, 1, 30, 'GOOG', 'Sell', 100),
 (2008, 5, 1, 'XOM', 'Buy', 100),
 (2008, 5, 8, 'XOM', 'Sell', 100),
 (2008, 9, 8, 'GOOG', 'Buy', 100),
 (2008, 9, 15, 'GOOG', 'Sell', 100)]

order_array = np.array(rows_list, dtype='object')

更好的选择是将前三列合并为datetime.date对象:

A better option would be to combine the first three columns into datetime.date objects:

import operator
import datetime as DT

for i in ...:
    seq = [DT.date(int(x.year), int(x.month), int(x.day)) ,s_sym, 'Buy', 100]   
    rows_list.append(seq)
rows_list.sort(key=operator.itemgetter(0,1,2))        
order_array = np.array(rows_list, dtype='object')

In [72]: order_array
Out[72]: 
array([[2008-01-23, AAPL, Buy, 100],
       [2008-01-30, AAPL, Sell, 100],
       [2008-01-23, GOOG, Buy, 100],
       [2008-01-30, GOOG, Sell, 100],
       [2008-09-08, GOOG, Buy, 100],
       [2008-09-15, GOOG, Sell, 100],
       [2008-05-01, XOM, Buy, 100],
       [2008-05-08, XOM, Sell, 100]], dtype=object)

即使这很简单,我也不喜欢dtype对象的NumPy数组. 使用NumPy阵列无法获得速度或节省内存空间的收益 本地dtypes.此时,您可能会发现使用Python列表列表 更快,语法上更容易处理.

Even though this is simple, I don't like NumPy arrays of dtype object. You get neither the speed nor the memory space-saving gains of NumPy arrays with native dtypes. At this point you might find working with a Python list of lists faster and syntactically easier to deal with.

使用结构化数组:

更多的NumPy-ish解决方案仍可提供速度和内存优势 使用结构化数组 (与同质数组相反).为了使 使用np.array的结构化数组,您需要显式提供dtype:

A more NumPy-ish solution which still offers speed and memory benefits is to use a structured array (as opposed to homogeneous array). To make a structured array with np.array you'll need to supply a dtype explicitly:

dt = [('year', '<i4'), ('month', '<i4'), ('day', '<i4'), ('symbol', '|S8'),
      ('action', '|S4'), ('value', '<i4')]
order_array = np.array(rows_list, dtype=dt)

In [47]: order_array.dtype
Out[47]: dtype([('year', '<i4'), ('month', '<i4'), ('day', '<i4'), ('symbol', '|S8'), ('action', '|S4'), ('value', '<i4')])

要对结构化数组进行排序,可以使用sort方法:

To sort the structured array you could use the sort method:

order_array.sort(order=['year', 'month', 'day'])


要使用结构化数组,您需要了解同构数组和结构化数组之间的一些区别:


To work with structured arrays, you'll need to know about some differences between homogenous and structured arrays:

您原来的同质阵列是二维的.相比之下,所有 结构化数组是一维的:

Your original homogenous array was 2-dimensional. In contrast, all structured arrays are 1-dimensional:

In [51]: order_array.shape
Out[51]: (8,)

如果您使用int索引结构化数组或遍历该数组,则您 返回行:

If you index the structured array with an int or iterate through the array, you get back rows:

In [52]: order_array[3]
Out[52]: (2008, 1, 30, 'GOOG', 'Sell', 100)

对于同构数组,您可以使用order_array[:, i]访问列 现在,使用结构化数组,您可以按名称访问它们: order_array['year'].

With homogeneous arrays you can access the columns with order_array[:, i] Now, with a structured array, you access them by name: e.g. order_array['year'].

或者使用熊猫:

如果您可以安装 Pandas ,我认为您使用Pandas DataFrame可能最快乐:

If you can install Pandas, I think you might be happiest working with a Pandas DataFrame:

In [73]: df = pd.DataFrame(rows_list, columns=['date', 'symbol', 'action', 'value'])
In [75]: df.sort(['date'])
Out[75]: 
         date symbol action  value
0  2008-01-23   AAPL    Buy    100
2  2008-01-23   GOOG    Buy    100
1  2008-01-30   AAPL   Sell    100
3  2008-01-30   GOOG   Sell    100
6  2008-05-01    XOM    Buy    100
7  2008-05-08    XOM   Sell    100
4  2008-09-08   GOOG    Buy    100
5  2008-09-15   GOOG   Sell    100

Pandas具有有用的功能,可以按日期对齐时间序列,以填补缺失的内容 值,分组和汇总/转换行或列.

Pandas has useful functions for aligning timeseries by dates, filling in missing values, grouping and aggregating/transforming rows or columns.

通常,对于年,月,日,使用单个日期列而不是三个整数值列更为有用.

Typically it is more useful to have a single date column instead of three integer-valued columns for the year, month, day.

如果您需要将年,月,日作为单独的列以便输出,例如csv,则可以将日期列替换为年,月,日列,如下所示:

If you need the year, month, day as separate columns for the purpose of outputing, to say csv, then you can replace the date column with year, month, day columns like this:

In [33]: df = df.join(df['date'].apply(lambda x: pd.Series([x.year, x.month, x.day], index=['year', 'month', 'day'])))

In [34]: del df['date']

In [35]: df
Out[35]: 
  symbol action  value  year  month  day
0   AAPL    Buy    100  2008      1   23
1   GOOG    Buy    100  2008      1   23
2   AAPL   Sell    100  2008      1   30
3   GOOG   Sell    100  2008      1   30
4    XOM    Buy    100  2008      5    1
5    XOM   Sell    100  2008      5    8
6   GOOG    Buy    100  2008      9    8
7   GOOG   Sell    100  2008      9   15

或者,如果您无需使用"date"列作为开头,那么您当然可以不理会rows_list,并从头开始用年,月,日列来构建DataFrame.排序仍然很容易:

Or, if you have no use for the 'date' column to begin with, you can of course leave rows_list alone and build the DataFrame with the year, month, day columns from the beginning. Sorting is still easy:

df.sort(['year', 'month', 'day'])

这篇关于在Python中的多列上排序numpy数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆