在Python中的多列上排序numpy数组 [英] Sorting numpy array on multiple columns in Python
问题描述
我正在尝试对column1上的以下数组进行排序,然后对column2然后对column3进行排序
I am trying to sort the following array on column1, then column2 and then column3
[['2008' '1' '23' 'AAPL' 'Buy' '100']
['2008' '1' '30' 'AAPL' 'Sell' '100']
['2008' '1' '23' 'GOOG' 'Buy' '100']
['2008' '1' '30' 'GOOG' 'Sell' '100']
['2008' '9' '8' 'GOOG' 'Buy' '100']
['2008' '9' '15' 'GOOG' 'Sell' '100']
['2008' '5' '1' 'XOM' 'Buy' '100']
['2008' '5' '8' 'XOM' 'Sell' '100']]
我使用了以下代码:
idx=np.lexsort((order_array[:,2],order_array[:,1],order_array[:,0]))
order_array=order_array[idx]
结果数组为
[['2008' '1' '23' 'AAPL' 'Buy' '100']
['2008' '1' '23' 'GOOG' 'Buy' '100']
['2008' '1' '30' 'AAPL' 'Sell' '100']
['2008' '1' '30' 'GOOG' 'Sell' '100']
['2008' '5' '1' 'XOM' 'Buy' '100']
['2008' '5' '8' 'XOM' 'Sell' '100']
['2008' '9' '15' 'GOOG' 'Sell' '100']
['2008' '9' '8' 'GOOG' 'Buy' '100']]
问题是最后两行是错误的.正确的数组应将最后一行作为倒数第二个.我已经尝试了所有方法,但无法理解为什么会这样.将不胜感激.
The problem is that the last two rows are wrong. The correct array should have the last row as the second last one. I have tried everything but am not able to understand why this is happening. Will appreciate some help.
我正在使用以下代码获取order_array.
I am using the following code for obtaining order_array.
for i in ….
x= ldt_timestamps[i] # this is a list of timestamps
s_sym=……
list=[int(x.year),int(x.month),int(x.day),s_sym,'Buy',100]
rows_list.append(list)
order_array=np.array(rows_list)
推荐答案
tldr:在对数值数组进行数值计算时,NumPy会发光.尽管有可能(请参阅下文),但NumPy不太适合此操作.您最好使用Pandas.
tldr: NumPy shines when doing numerical calculations on numerical arrays. Although it is possible (see below) NumPy is not well suited for this. You're probably better off using Pandas.
问题原因:
这些值正在按字符串的形式 进行排序.您需要将它们排序为ints
.
The values are being sorted as strings. You need to sort them as ints
.
In [7]: sorted(['15', '8'])
Out[7]: ['15', '8']
In [8]: sorted([15, 8])
Out[8]: [8, 15]
发生这种情况是因为order_array
包含字符串.您需要在适当的地方将这些字符串转换为ints
.
This happened because order_array
contains strings. You need to convert those strings to ints
where appropriate.
将dtype从string-dtype转换为数字dtype需要为新数组分配空间.因此,最好从一开始就修改创建order_array
的方式.
Converting dtypes from string-dtype to numerical dtype requires allocating space for a new array. Therefore, you would probably be better off revising the way you are creating order_array
from the beginning.
有趣的是,即使您在调用时将值转换为ints
Interestingly, even though you converted the values to ints, when you call
order_array = np.array(rows_list)
NumPy默认情况下会创建一个齐次数组.在齐次数组中,每个值都具有相同的dtype.所以NumPy试图在您所有的人中找到共同点 值并选择了字符串dtype,从而阻止了您将字符串转换为int的工作!
NumPy by default creates a homogenous array. In a homogeneous array every value has a same dtype. So NumPy tried to find the common denominator among all your values and chose a string dtype, thwarting the effort you put into converting the strings to ints!
您可以通过检查order_array.dtype
来自己检查dtype:
You can check the dtype for yourself by inspecting order_array.dtype
:
In [42]: order_array = np.array(rows_list)
In [43]: order_array.dtype
Out[43]: dtype('|S4')
现在,我们该如何解决呢?
Now, how do we fix this?
使用对象dtype:
最简单的方法是使用'object'dtype
The simplest way is to use an 'object' dtype
In [53]: order_array = np.array(rows_list, dtype='object')
In [54]: order_array
Out[54]:
array([[2008, 1, 23, AAPL, Buy, 100],
[2008, 1, 30, AAPL, Sell, 100],
[2008, 1, 23, GOOG, Buy, 100],
[2008, 1, 30, GOOG, Sell, 100],
[2008, 9, 8, GOOG, Buy, 100],
[2008, 9, 15, GOOG, Sell, 100],
[2008, 5, 1, XOM, Buy, 100],
[2008, 5, 8, XOM, Sell, 100]], dtype=object)
这里的问题是np.lexsort
或np.sort
不适用于以下数组
dtype object
.要解决该问题,您可以对rows_list
进行排序
在创建order_list
之前:
The problem here is that np.lexsort
or np.sort
do not work on arrays of
dtype object
. To get around that problem, you could sort the rows_list
before creating order_list
:
In [59]: import operator
In [60]: rows_list.sort(key=operator.itemgetter(0,1,2))
Out[60]:
[(2008, 1, 23, 'AAPL', 'Buy', 100),
(2008, 1, 23, 'GOOG', 'Buy', 100),
(2008, 1, 30, 'AAPL', 'Sell', 100),
(2008, 1, 30, 'GOOG', 'Sell', 100),
(2008, 5, 1, 'XOM', 'Buy', 100),
(2008, 5, 8, 'XOM', 'Sell', 100),
(2008, 9, 8, 'GOOG', 'Buy', 100),
(2008, 9, 15, 'GOOG', 'Sell', 100)]
order_array = np.array(rows_list, dtype='object')
更好的选择是将前三列合并为datetime.date对象:
A better option would be to combine the first three columns into datetime.date objects:
import operator
import datetime as DT
for i in ...:
seq = [DT.date(int(x.year), int(x.month), int(x.day)) ,s_sym, 'Buy', 100]
rows_list.append(seq)
rows_list.sort(key=operator.itemgetter(0,1,2))
order_array = np.array(rows_list, dtype='object')
In [72]: order_array
Out[72]:
array([[2008-01-23, AAPL, Buy, 100],
[2008-01-30, AAPL, Sell, 100],
[2008-01-23, GOOG, Buy, 100],
[2008-01-30, GOOG, Sell, 100],
[2008-09-08, GOOG, Buy, 100],
[2008-09-15, GOOG, Sell, 100],
[2008-05-01, XOM, Buy, 100],
[2008-05-08, XOM, Sell, 100]], dtype=object)
即使这很简单,我也不喜欢dtype对象的NumPy数组. 使用NumPy阵列无法获得速度或节省内存空间的收益 本地dtypes.此时,您可能会发现使用Python列表列表 更快,语法上更容易处理.
Even though this is simple, I don't like NumPy arrays of dtype object. You get neither the speed nor the memory space-saving gains of NumPy arrays with native dtypes. At this point you might find working with a Python list of lists faster and syntactically easier to deal with.
使用结构化数组:
更多的NumPy-ish解决方案仍可提供速度和内存优势
使用结构化数组 (与同质数组相反).为了使
使用np.array
的结构化数组,您需要显式提供dtype:
A more NumPy-ish solution which still offers speed and memory benefits is
to use a structured array (as opposed to homogeneous array). To make a
structured array with np.array
you'll need to supply a dtype explicitly:
dt = [('year', '<i4'), ('month', '<i4'), ('day', '<i4'), ('symbol', '|S8'),
('action', '|S4'), ('value', '<i4')]
order_array = np.array(rows_list, dtype=dt)
In [47]: order_array.dtype
Out[47]: dtype([('year', '<i4'), ('month', '<i4'), ('day', '<i4'), ('symbol', '|S8'), ('action', '|S4'), ('value', '<i4')])
要对结构化数组进行排序,可以使用sort
方法:
To sort the structured array you could use the sort
method:
order_array.sort(order=['year', 'month', 'day'])
要使用结构化数组,您需要了解同构数组和结构化数组之间的一些区别:
To work with structured arrays, you'll need to know about some differences between homogenous and structured arrays:
您原来的同质阵列是二维的.相比之下,所有 结构化数组是一维的:
Your original homogenous array was 2-dimensional. In contrast, all structured arrays are 1-dimensional:
In [51]: order_array.shape
Out[51]: (8,)
如果您使用int索引结构化数组或遍历该数组,则您 返回行:
If you index the structured array with an int or iterate through the array, you get back rows:
In [52]: order_array[3]
Out[52]: (2008, 1, 30, 'GOOG', 'Sell', 100)
对于同构数组,您可以使用order_array[:, i]
访问列
现在,使用结构化数组,您可以按名称访问它们: order_array['year']
.
With homogeneous arrays you can access the columns with order_array[:, i]
Now, with a structured array, you access them by name: e.g. order_array['year']
.
或者使用熊猫:
如果您可以安装 Pandas ,我认为您使用Pandas DataFrame可能最快乐:
If you can install Pandas, I think you might be happiest working with a Pandas DataFrame:
In [73]: df = pd.DataFrame(rows_list, columns=['date', 'symbol', 'action', 'value'])
In [75]: df.sort(['date'])
Out[75]:
date symbol action value
0 2008-01-23 AAPL Buy 100
2 2008-01-23 GOOG Buy 100
1 2008-01-30 AAPL Sell 100
3 2008-01-30 GOOG Sell 100
6 2008-05-01 XOM Buy 100
7 2008-05-08 XOM Sell 100
4 2008-09-08 GOOG Buy 100
5 2008-09-15 GOOG Sell 100
Pandas具有有用的功能,可以按日期对齐时间序列,以填补缺失的内容 值,分组和汇总/转换行或列.
Pandas has useful functions for aligning timeseries by dates, filling in missing values, grouping and aggregating/transforming rows or columns.
通常,对于年,月,日,使用单个日期列而不是三个整数值列更为有用.
Typically it is more useful to have a single date column instead of three integer-valued columns for the year, month, day.
如果您需要将年,月,日作为单独的列以便输出,例如csv,则可以将日期列替换为年,月,日列,如下所示:
If you need the year, month, day as separate columns for the purpose of outputing, to say csv, then you can replace the date column with year, month, day columns like this:
In [33]: df = df.join(df['date'].apply(lambda x: pd.Series([x.year, x.month, x.day], index=['year', 'month', 'day'])))
In [34]: del df['date']
In [35]: df
Out[35]:
symbol action value year month day
0 AAPL Buy 100 2008 1 23
1 GOOG Buy 100 2008 1 23
2 AAPL Sell 100 2008 1 30
3 GOOG Sell 100 2008 1 30
4 XOM Buy 100 2008 5 1
5 XOM Sell 100 2008 5 8
6 GOOG Buy 100 2008 9 8
7 GOOG Sell 100 2008 9 15
或者,如果您无需使用"date"列作为开头,那么您当然可以不理会rows_list
,并从头开始用年,月,日列来构建DataFrame.排序仍然很容易:
Or, if you have no use for the 'date' column to begin with, you can of course leave rows_list
alone and build the DataFrame with the year, month, day columns from the beginning. Sorting is still easy:
df.sort(['year', 'month', 'day'])
这篇关于在Python中的多列上排序numpy数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!