在每个组 pandas 数据框中对列进行排序并选择前n行 [英] Sorting columns and selecting top n rows in each group pandas dataframe
问题描述
我有一个这样的数据框:
I have a dataframe like this:
mainid pidx pidy score
1 a b 2
1 a c 5
1 c a 7
1 c b 2
1 a e 8
2 x y 1
2 y z 3
2 z y 5
2 x w 12
2 x v 1
2 y x 6
我想先groupby on column 'pidx'
然后sort score in descending order in each group
即每个pidx
I want to groupby on column 'pidx'
and then sort score in descending order in each group
i.e for each pidx
,然后是select head(2)
,即每个组的前2名.
and then select head(2)
i.e top 2 from each group.
我正在寻找的结果是这样的:
The result I am looking for is like this:
mainid pidx pidy score
1 a e 8
1 a c 5
1 c a 7
1 c b 2
2 x w 12
2 x y 1
2 y x 6
2 y z 3
2 z y 5
我尝试过的是:
df.sort(['pidx','score'],ascending = False).groupby('pidx').head(2)
,这似乎可行,但是我不知道如果处理庞大的数据集,这是否是正确的方法.我还能使用什么其他最佳方法来获得这种结果?
and this seems to work, but I dont know if it's the right approach if working on a huge dataset. What other best method can I use to get such result?
推荐答案
有2种解决方案:
1. sort_values
和汇总 head
:
1.sort_values
and aggregate head
:
df1 = df.sort_values('score',ascending = False).groupby('pidx').head(2)
print (df1)
mainid pidx pidy score
8 2 x w 12
4 1 a e 8
2 1 c a 7
10 2 y x 6
1 1 a c 5
7 2 z y 5
6 2 y z 3
3 1 c b 2
5 2 x y 1
2.set_index
and aggregate nlargest
:
df = df.set_index(['mainid','pidy']).groupby('pidx')['score'].nlargest(2).reset_index()
print (df)
pidx mainid pidy score
0 a 1 e 8
1 a 1 c 5
2 c 1 a 7
3 c 1 b 2
4 x 2 w 12
5 x 2 y 1
6 y 2 x 6
7 y 2 z 3
8 z 2 y 5
时间:
np.random.seed(123)
N = 1000000
L1 = list('abcdefghijklmnopqrstu')
L2 = list('efghijklmnopqrstuvwxyz')
df = pd.DataFrame({'mainid':np.random.randint(1000, size=N),
'pidx': np.random.randint(10000, size=N),
'pidy': np.random.choice(L2, N),
'score':np.random.randint(1000, size=N)})
#print (df)
def epat(df):
grouped = df.groupby('pidx')
new_df = pd.DataFrame([], columns = df.columns)
for key, values in grouped:
new_df = pd.concat([new_df, grouped.get_group(key).sort_values('score', ascending=True)[:2]], 0)
return (new_df)
print (epat(df))
In [133]: %timeit (df.sort_values('score',ascending = False).groupby('pidx').head(2))
1 loop, best of 3: 309 ms per loop
In [134]: %timeit (df.set_index(['mainid','pidy']).groupby('pidx')['score'].nlargest(2).reset_index())
1 loop, best of 3: 7.11 s per loop
In [147]: %timeit (epat(df))
1 loop, best of 3: 22 s per loop
这篇关于在每个组 pandas 数据框中对列进行排序并选择前n行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!