在Pandas中如何根据列的值对多索引的一个级别进行排序,同时保持另一级别的分组 [英] In Pandas How to sort one level of a multi-index based on the values of a column, while maintaining the grouping of the other level
问题描述
我现在正在大学上一门数据挖掘课程,但是我对多索引排序问题有些犹豫.
I'm taking a Data Mining course at university right now, but I'm a wee bit stuck on a multi-index sorting problem.
实际数据涉及大约100万部电影评论,而我正尝试根据美国邮政编码进行分析,但为了测试我想做的事情,我一直在使用较小的数据集我正在使用年龄段来随机生成10部电影的250个评分,而不是邮政编码.
The actual data involves about 1 million reviews of movies, and I'm trying to analyze that based on American zip codes, but to test out how to do what I want, I've been using a much smaller data set of 250 randomly generated ratings for 10 movies and instead of zip codes, I'm using age groups.
这就是我现在所拥有的,它是Pandas中的多索引DataFrame,具有两个级别:组"和标题"
So this is what I have right now, it's a multiindexed DataFrame in Pandas with two levels, 'group' and 'title'
rating
group title
Alien 4.000000
Argo 2.166667
Adults Ben-Hur 3.666667
Gandhi 3.200000
... ...
Alien 3.000000
Argo 3.750000
Coeds Ben-Hur 3.000000
Gandhi 2.833333
... ...
Alien 2.500000
Argo 2.750000
Kids Ben-Hur 3.000000
Gandhi 3.200000
... ...
我的目标是根据组中的标题对标题进行排序(并且仅在每个组中显示最受欢迎的5个左右的标题)
What I'm aiming for is to sort the titles based on their rating within the group (and only show the most popular 5 or so titles within each group)
是这样的(但我只会在每个组中显示两个标题):
So something like this (but I'm only going to show two titles in each group):
rating
group title
Alien 4.000000
Adults Ben-Hur 3.666667
Argo 3.750000
Coeds Alien 3.000000
Gandhi 3.200000
Kids Ben-Hur 3.000000
有人知道该怎么做吗?我尝试过sort_order,sort_index等并交换级别,但它们也将组混合在一起.因此,它看起来像:
Anyone know how to do this? I've tried sort_order, sort_index, etc and swapping the levels, but they mix up the groups too. So it then looks like:
rating
group title
Adults Alien 4.000000
Coeds Argo 3.750000
Adults Ben-Hur 3.666667
Kids Gandhi 3.666667
Coeds Alien 3.000000
Kids Ben-Hur 3.000000
我正在寻找类似这样的东西:在熊猫中进行多索引排序,但我不想基于其他级别进行排序,而是希望基于值进行排序.有点像那个人想根据他的销售栏来排序.
I'm kind of looking for something like this: Multi-Index Sorting in Pandas, but instead of sorting based on another level, I want to sort based on the values. Kind of like if that person wanted to sort based on his sales column.
谢谢!
推荐答案
您正在寻找注意;这可以就地工作(即修改s),以使用订单:
Note; this works inplace (i.e. modifies s), to return a copy use order:
In [14]: s.order()
Out[14]:
1 3 1
2 1 2
1 1 3
dtype: int64
更新:我意识到您的实际要求,我认为这应该是排序级别的一个选项,但是现在我认为您必须reset_index,groupby并应用:
Update: I realised what you were actually asking, and I think this ought to be an option in sortlevels, but for now I think you have to reset_index, groupby and apply:
In [21]: s.reset_index(name='s').groupby('level_0').apply(lambda s: s.sort('s')).set_index(['level_0', 'level_1'])['s']
Out[21]:
level_0 level_1
1 3 1
1 3
2 1 2
Name: 0, dtype: int64
注意:之后,您可以将级别名称设置为[None,None].
这篇关于在Pandas中如何根据列的值对多索引的一个级别进行排序,同时保持另一级别的分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!