pandas groupby:每组TOP 3个值 [英] pandas groupby: TOP 3 values for each group

查看:139
本文介绍了 pandas groupby:每组TOP 3个值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

A new and more generic question has been posted in pandas groupby: TOP 3 values in each group and store in DataFrame and a working solution has been answered there.

在此示例中,我创建了一个数据帧df,其中某些随机数据的间隔为5分钟. 我想创建一个数据框gdf(分组的df ),其中列出了每小时的 3个最高值.

In this example I create a dataframe df with some random data spaced 5 minutes. I want to create a dataframe gdf (grouped df) where the 3 highest values for each hour are listed.

即:根据这一系列值

                     VAL
TIME                    
2017-12-08 00:00:00   29
2017-12-08 00:05:00   56
2017-12-08 00:10:00   82
2017-12-08 00:15:00   13
2017-12-08 00:20:00   35
2017-12-08 00:25:00   53
2017-12-08 00:30:00   25
2017-12-08 00:35:00   23
2017-12-08 00:40:00   21
2017-12-08 00:45:00   12
2017-12-08 00:50:00   15
2017-12-08 00:55:00    9
2017-12-08 01:00:00   13
2017-12-08 01:05:00   87
2017-12-08 01:10:00    9
2017-12-08 01:15:00   63
2017-12-08 01:20:00   62
2017-12-08 01:25:00   52
2017-12-08 01:30:00   43
2017-12-08 01:35:00   77
2017-12-08 01:40:00   95
2017-12-08 01:45:00   79
2017-12-08 01:50:00   77
2017-12-08 01:55:00    5
2017-12-08 02:00:00   78
2017-12-08 02:05:00   41
2017-12-08 02:10:00   10
2017-12-08 02:15:00   10
2017-12-08 02:20:00   88

我非常接近解决方案,但是我找不到最后一步的正确语法.我到现在为止(largest3)是:

​I am very close to the solution but I cannot find the correct syntax for the last step. What I get up to now (largest3) is:

                                           VAL
TIME                  TIME                    
2017-12-08 00:00:00   2017-12-08 00:10:00   82
                      2017-12-08 00:05:00   56
                      2017-12-08 00:25:00   53
2017-12-08 01:00:00   2017-12-08 01:40:00   95
                      2017-12-08 01:05:00   87
                      2017-12-08 01:45:00   79
2017-12-08 02:00:00   2017-12-08 02:20:00   88
                      2017-12-08 02:00:00   78
                      2017-12-08 02:05:00   41

我想从中获得gdf(达到每个最大值的时间并不重要):

from which I would like to obtain this gdf (the time when each maximum was reached is not important):

                    VAL1  VAL2  VAL3
 TIME                
2017-12-08 00:00:00   82    56    53
2017-12-08 01:00:00   95    87    79
2017-12-08 02:00:00   88    78    41

这是代码:

import pandas as pd
from datetime import *
import numpy as np

# test data
df = pd.DataFrame()
date_ref = datetime(2017,12,8,0,0,0)
days = pd.date_range(date_ref, date_ref + timedelta(0.1), freq='5min')
np.random.seed(seed=1111)
data1 = np.random.randint(1, high=100, size=len(days))
df = pd.DataFrame({'TIME': days, 'VAL': data1})
df = df.set_index('TIME')
print(df)
print("----")

# groupby
group1 = df.groupby(pd.Grouper(freq='1H'))
largest3 = pd.DataFrame(group1['VAL'].nlargest(3))
print(largest3)

gdf = pd.DataFrame()
# ???? <-------------------

谢谢.

推荐答案

注意:仅当每个组至少有3行时,此解决方案才有效

尝试以下方法:

In [59]: x = (df.groupby(pd.Grouper(freq='H'))['VAL']
                .apply(lambda x: x.nlargest(3))
                .reset_index(level=1, drop=True)
                .to_frame('VAL'))

In [60]: x
Out[60]:
                     VAL
TIME
2017-12-08 00:00:00   82
2017-12-08 00:00:00   56
2017-12-08 00:00:00   53
2017-12-08 01:00:00   95
2017-12-08 01:00:00   87
2017-12-08 01:00:00   79
2017-12-08 02:00:00   88
2017-12-08 02:00:00   78
2017-12-08 02:00:00   41

In [61]: x.set_index(np.arange(len(x)) % 3, append=True)['VAL'].unstack().add_prefix('VAL')
Out[61]:
                     VAL0  VAL1  VAL2
TIME
2017-12-08 00:00:00    82    56    53
2017-12-08 01:00:00    95    87    79
2017-12-08 02:00:00    88    78    41

一些解释:

In [94]: x.set_index(np.arange(len(x)) % 3, append=True)
Out[94]:
                       VAL
TIME
2017-12-08 00:00:00 0   82
                    1   56
                    2   53
2017-12-08 01:00:00 0   95
                    1   87
                    2   79
2017-12-08 02:00:00 0   88
                    1   78
                    2   41

In [95]: x.set_index(np.arange(len(x)) % 3, append=True)['VAL'].unstack()
Out[95]:
                      0   1   2
TIME
2017-12-08 00:00:00  82  56  53
2017-12-08 01:00:00  95  87  79
2017-12-08 02:00:00  88  78  41

这篇关于 pandas groupby:每组TOP 3个值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆