在 pandas 群体中排名靠前 [英] ranks within groupby in pandas

查看:120
本文介绍了在 pandas 群体中排名靠前的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个典型的面板数据(在计量经济学术语中,不是熊猫面板对象)。数据框有一个 Date 列和一个 ID 列,以及其他包含特定值的列。对于每个日期,我需要跨越基于V1的ID对10个组(十进制)进行横截面排名,并创建一个名为 rank_col (取值为1到10)的新列识别等级。然后汇集所有的rank1,rank2,... rank10,以获得像mean,std这样的统计信息。



SAS可以通过以下代码很容易地完成此操作也说明了我的目的:

pre code proc sort data = df;按日期
;
proc rank data = df out = df_ranked groups = 10;
var V1;
排名rank_col;按日期
;
跑;

df_ranked 与<$ c $相同c> df ,除了它有更多的名为 rank_col 的列包含每行所属的排名组。



抱歉,我没有示例数据来显示结构,需要真实的长数据来说明。但SAS代码完全显示了我所追求的。



感谢您的帮助! 找出:

pre $ def分组(数据):
dec = pd.qcut(data ['V1'], 10,labels = False)
data ['ranks'] = dec
返回数据
df_ranked = df.groupby('Date')。apply(grouping)

假定 dec 保留每行的正确位置。



如果您有更好的方法,请发帖,或者指出此方法中的任何错误。

谢谢!



编辑:如果您执行以下操作,只需返回一个新的列即可:

 >>> df 


日期编号V1
0 2013-01-01 1 10
1 2013-01-01 2 8
2 2013-01-01 3 6
3 2013-01-01 4 11
4 2013-01-01 5 13
5 2013-01-01 6 4
6 2013-01-01 7 2
7 2013-02-01 1 1
8 2013-02-01 2 3
9 2013-02-01 3 9
10 2013-02-01 4 11
11 2013-02-01 5 7
12 2013-02-01 6 4
13 2013-02-01 7 6
14 2013-02-01 8 14

>>> foo = lambda x:pd.Series(pd.qcut(x,10,labels = False),index = x.index)
>>> df ['ranks'] = df.groupby('Date')['V1']。apply(foo)
>>> df

日期编号V1等级
0 2013-01-01 1 10 6
1 2013-01-01 2 8 4
2 2013-01-01 3 6 3
3 2013-01-01 4 11 8
4 2013-01-01 5 13 9
5 2013-01-01 6 4 1
6 2013-01- 01 7 2 0
7 2013-02-01 1 1 0
8 2013-02-01 2 3 1
9 2013-02-01 3 9 7
10 2013- 02-01 4 11 8
11 2013-02-01 5 7 5
12 2013-02-01 6 4 2
13 2013-02-01 7 6 4
14 2013-02-01 8 14 9


I have a typical "panel data" (in econometric terms, not pandas panel object). The dataframe has a Date column and a ID column, and other columns that contain certain values. For each Date, I need to cross-sectionally rank across IDs based on V1 into 10 groups (deciles) and create a new column called rank_col (take values 1 to 10) to identify rank. Then pool all the rank1, rank2,...rank10 across time to get some stats like mean,std.

This can be accomplished easily in SAS by following code and it also illustrate what my purpose:

proc sort data=df;
    by Date;
proc rank data=df out=df_ranked groups=10;
    var V1;
    ranks rank_col;
    by Date;
run;

df_ranked is identical to df except that it has more column called rank_col which containes the rank group that each row belongs to.

Sorry I don't have a sample data to show the structure, a real long data is needed to illustrate. But the SAS code shows exactly what I am after.

Thanks for your help!

解决方案

A way I just find figured out:

def grouping(data):
    dec=pd.qcut(data['V1'],10,labels=False)
    data['ranks']=dec
    return data
df_ranked=df.groupby('Date').apply(grouping)

This assumes dec preserve the right position for each row.

Please post if you have a better way, or point out any mistakes in this method.

Thanks!

Edit: You can just return a single new ranks column if you do something like the following:

>>> df


         Date  id  V1
0  2013-01-01   1  10
1  2013-01-01   2   8
2  2013-01-01   3   6
3  2013-01-01   4  11
4  2013-01-01   5  13
5  2013-01-01   6   4
6  2013-01-01   7   2
7  2013-02-01   1   1
8  2013-02-01   2   3
9  2013-02-01   3   9
10 2013-02-01   4  11
11 2013-02-01   5   7
12 2013-02-01   6   4
13 2013-02-01   7   6
14 2013-02-01   8  14

>>> foo = lambda x: pd.Series(pd.qcut(x,10,labels=False),index=x.index)
>>> df['ranks'] = df.groupby('Date')['V1'].apply(foo)
>>> df

         Date  id  V1  ranks
0  2013-01-01   1  10      6
1  2013-01-01   2   8      4
2  2013-01-01   3   6      3
3  2013-01-01   4  11      8
4  2013-01-01   5  13      9
5  2013-01-01   6   4      1
6  2013-01-01   7   2      0
7  2013-02-01   1   1      0
8  2013-02-01   2   3      1
9  2013-02-01   3   9      7
10 2013-02-01   4  11      8
11 2013-02-01   5   7      5
12 2013-02-01   6   4      2
13 2013-02-01   7   6      4
14 2013-02-01   8  14      9

这篇关于在 pandas 群体中排名靠前的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆