pandas 中的多索引排序 [英] Multi-Index Sorting in Pandas
问题描述
我有一个通过groupby操作创建的多索引DataFrame.我正在尝试使用多个索引级别进行复合排序,但似乎找不到能够满足我需要的排序函数.
I have a multi-index DataFrame created via a groupby operation. I'm trying to do a compound sort using several levels of the index, but I can't seem to find a sort function that does what I need.
初始数据集如下所示(各种产品的每日销售计数):
Initial dataset looks something like this (daily sales counts of various products):
Date Manufacturer Product Name Product Launch Date Sales
0 2013-01-01 Apple iPod 2001-10-23 12
1 2013-01-01 Apple iPad 2010-04-03 13
2 2013-01-01 Samsung Galaxy 2009-04-27 14
3 2013-01-01 Samsung Galaxy Tab 2010-09-02 15
4 2013-01-02 Apple iPod 2001-10-23 22
5 2013-01-02 Apple iPad 2010-04-03 17
6 2013-01-02 Samsung Galaxy 2009-04-27 10
7 2013-01-02 Samsung Galaxy Tab 2010-09-02 7
我使用groupby获取日期范围内的总和:
I use groupby to get a sum over the date range:
> grouped = df.groupby(['Manufacturer', 'Product Name', 'Product Launch Date']).sum()
Sales
Manufacturer Product Name Product Launch Date
Apple iPad 2010-04-03 30
iPod 2001-10-23 34
Samsung Galaxy 2009-04-27 24
Galaxy Tab 2010-09-02 22
到目前为止一切顺利!
现在我要做的最后一件事是按发布日期对每个制造商的产品进行排序,但将它们按制造商分层进行分组-这是我要做的所有事情:
Now the last thing I want to do is sort each manufacturer's products by launch date, but keep them grouped hierarchically under Manufacturer - here's all I am trying to do:
Sales
Manufacturer Product Name Product Launch Date
Apple iPod 2001-10-23 34
iPad 2010-04-03 30
Samsung Galaxy 2009-04-27 24
Galaxy Tab 2010-09-02 22
当我尝试sortlevel()时,我失去了以前的漂亮的每个公司的层次结构:
When I try sortlevel() I lose the nice per-company hierarchy I had before:
> grouped.sortlevel('Product Launch Date')
Sales
Manufacturer Product Name Product Launch Date
Apple iPod 2001-10-23 34
Samsung Galaxy 2009-04-27 24
Apple iPad 2010-04-03 30
Samsung Galaxy Tab 2010-09-02 22
sort()和sort_index()只会失败:
sort() and sort_index() just fail:
grouped.sort(['Manufacturer','Product Launch Date'])
KeyError: u'no item named Manufacturer'
grouped.sort_index(by=['Manufacturer','Product Launch Date'])
KeyError: u'no item named Manufacturer'
看起来像一个简单的操作,但是我不太清楚.
Seems like a simple operation, but I can't quite figure it out.
我并不为此而使用MultiIndex,但是由于那是groupby()返回的,所以这就是我一直在使用的东西.
I'm not tied to using a MultiIndex for this, but since that's what groupby() returns, that's what I've been working with.
顺便说一句,生成初始DataFrame的代码是:
BTW the code to produce the initial DataFrame is:
data = {
'Date': ['2013-01-01', '2013-01-01', '2013-01-01', '2013-01-01', '2013-01-02', '2013-01-02', '2013-01-02', '2013-01-02'],
'Manufacturer' : ['Apple', 'Apple', 'Samsung', 'Samsung', 'Apple', 'Apple', 'Samsung', 'Samsung',],
'Product Name' : ['iPod', 'iPad', 'Galaxy', 'Galaxy Tab', 'iPod', 'iPad', 'Galaxy', 'Galaxy Tab'],
'Product Launch Date' : ['2001-10-23', '2010-04-03', '2009-04-27', '2010-09-02','2001-10-23', '2010-04-03', '2009-04-27', '2010-09-02'],
'Sales' : [12, 13, 14, 15, 22, 17, 10, 7]
}
df = DataFrame(data, columns=['Date', 'Manufacturer', 'Product Name', 'Product Launch Date', 'Sales'])
推荐答案
一个技巧是更改级别的顺序:
A hack would be to change the order of the levels:
In [11]: g
Out[11]:
Sales
Manufacturer Product Name Product Launch Date
Apple iPad 2010-04-03 30
iPod 2001-10-23 34
Samsung Galaxy 2009-04-27 24
Galaxy Tab 2010-09-02 22
In [12]: g.index = g.index.swaplevel(1, 2)
排序级别,(您已经发现)排序多索引级别的顺序:
Sortlevel, which (as you've found) sorts the MultiIndex levels in order:
In [13]: g = g.sortlevel()
然后换回:
In [14]: g.index = g.index.swaplevel(1, 2)
In [15]: g
Out[15]:
Sales
Manufacturer Product Name Product Launch Date
Apple iPod 2001-10-23 34
iPad 2010-04-03 30
Samsung Galaxy 2009-04-27 24
Galaxy Tab 2010-09-02 22
我认为sortlevel不应按顺序对其余标签进行排序,因此会产生github问题. :)尽管值得一提的是有关.
I'm of the opinion that sortlevel should not sort the remaining labels in order, so will create a github issue. :) Although it's worth mentioning the docnote about "the need for sortedness".
注意:您可以通过重新排序初始分组依据的顺序来避免出现第一个swaplevel
:
Note: you could avoid the first swaplevel
by reordering the order of the initial groupby:
g = df.groupby(['Manufacturer', 'Product Launch Date', 'Product Name']).sum()
这篇关于 pandas 中的多索引排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!