在 pandas 群体中排名靠前 [英] ranks within groupby in pandas
问题描述
我有一个典型的面板数据(在计量经济学术语中,不是熊猫面板对象)。数据框有一个 Date
列和一个 ID
列,以及其他包含特定值的列。对于每个日期,我需要跨越基于V1的ID对10个组(十进制)进行横截面排名,并创建一个名为 rank_col
(取值为1到10)的新列识别等级。然后汇集所有的rank1,rank2,... rank10,以获得像mean,std这样的统计信息。
SAS可以通过以下代码很容易地完成此操作也说明了我的目的:
pre code proc sort data = df;按日期
;
proc rank data = df out = df_ranked groups = 10;
var V1;
排名rank_col;按日期
;
跑;
df_ranked
与<$ c $相同c> df ,除了它有更多的名为 rank_col
的列包含每行所属的排名组。
抱歉,我没有示例数据来显示结构,需要真实的长数据来说明。但SAS代码完全显示了我所追求的。
感谢您的帮助! 找出:
pre $ def分组(数据):
dec = pd.qcut(data ['V1'], 10,labels = False)
data ['ranks'] = dec
返回数据
df_ranked = df.groupby('Date')。apply(grouping)
假定 dec
保留每行的正确位置。
如果您有更好的方法,请发帖,或者指出此方法中的任何错误。
谢谢!
编辑:如果您执行以下操作,只需返回一个新的列
列即可:
>>> df
日期编号V1
0 2013-01-01 1 10
1 2013-01-01 2 8
2 2013-01-01 3 6
3 2013-01-01 4 11
4 2013-01-01 5 13
5 2013-01-01 6 4
6 2013-01-01 7 2
7 2013-02-01 1 1
8 2013-02-01 2 3
9 2013-02-01 3 9
10 2013-02-01 4 11
11 2013-02-01 5 7
12 2013-02-01 6 4
13 2013-02-01 7 6
14 2013-02-01 8 14
>>> foo = lambda x:pd.Series(pd.qcut(x,10,labels = False),index = x.index)
>>> df ['ranks'] = df.groupby('Date')['V1']。apply(foo)
>>> df
日期编号V1等级
0 2013-01-01 1 10 6
1 2013-01-01 2 8 4
2 2013-01-01 3 6 3
3 2013-01-01 4 11 8
4 2013-01-01 5 13 9
5 2013-01-01 6 4 1
6 2013-01- 01 7 2 0
7 2013-02-01 1 1 0
8 2013-02-01 2 3 1
9 2013-02-01 3 9 7
10 2013- 02-01 4 11 8
11 2013-02-01 5 7 5
12 2013-02-01 6 4 2
13 2013-02-01 7 6 4
14 2013-02-01 8 14 9
I have a typical "panel data" (in econometric terms, not pandas panel object). The dataframe has a Date
column and a ID
column, and other columns that contain certain values. For each Date, I need to cross-sectionally rank across IDs based on V1 into 10 groups (deciles) and create a new column called rank_col
(take values 1 to 10) to identify rank. Then pool all the rank1, rank2,...rank10 across time to get some stats like mean,std.
This can be accomplished easily in SAS by following code and it also illustrate what my purpose:
proc sort data=df;
by Date;
proc rank data=df out=df_ranked groups=10;
var V1;
ranks rank_col;
by Date;
run;
df_ranked
is identical to df
except that it has more column called rank_col
which containes the rank group that each row belongs to.
Sorry I don't have a sample data to show the structure, a real long data is needed to illustrate. But the SAS code shows exactly what I am after.
Thanks for your help!
A way I just find figured out:
def grouping(data):
dec=pd.qcut(data['V1'],10,labels=False)
data['ranks']=dec
return data
df_ranked=df.groupby('Date').apply(grouping)
This assumes dec
preserve the right position for each row.
Please post if you have a better way, or point out any mistakes in this method.
Thanks!
Edit: You can just return a single new ranks
column if you do something like the following:
>>> df
Date id V1
0 2013-01-01 1 10
1 2013-01-01 2 8
2 2013-01-01 3 6
3 2013-01-01 4 11
4 2013-01-01 5 13
5 2013-01-01 6 4
6 2013-01-01 7 2
7 2013-02-01 1 1
8 2013-02-01 2 3
9 2013-02-01 3 9
10 2013-02-01 4 11
11 2013-02-01 5 7
12 2013-02-01 6 4
13 2013-02-01 7 6
14 2013-02-01 8 14
>>> foo = lambda x: pd.Series(pd.qcut(x,10,labels=False),index=x.index)
>>> df['ranks'] = df.groupby('Date')['V1'].apply(foo)
>>> df
Date id V1 ranks
0 2013-01-01 1 10 6
1 2013-01-01 2 8 4
2 2013-01-01 3 6 3
3 2013-01-01 4 11 8
4 2013-01-01 5 13 9
5 2013-01-01 6 4 1
6 2013-01-01 7 2 0
7 2013-02-01 1 1 0
8 2013-02-01 2 3 1
9 2013-02-01 3 9 7
10 2013-02-01 4 11 8
11 2013-02-01 5 7 5
12 2013-02-01 6 4 2
13 2013-02-01 7 6 4
14 2013-02-01 8 14 9
这篇关于在 pandas 群体中排名靠前的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!