pandas groupby:每个组中的前3个值并存储在DataFrame中 [英] pandas groupby: TOP 3 values in each group and store in DataFrame

查看:50
本文介绍了pandas groupby:每个组中的前3个值并存储在DataFrame中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是对 pandas groupby的跟踪:TOP每组3个值 如果每个组中的行数至少为3,则此处描述的解决方案是完美的,但如果其中至少一个组不够大,则该解决方案将失败.

This is a followup to pandas groupby: TOP 3 values for each group The solution described there is perfect if the number of rows in each group is at least 3, but it fails if at least one of the groups isn't big enough.

我在这里提出一个新的数据集,该数据集需要其他解决方案.

I propose here a new data set that requires another solution.

一些数据是随机保存的,我需要每小时找到最高的3个值:

Some data are saved at random times and I need to find the highest 3 values for each hour:

                     VAL
TIME                    
2017-12-08 00:55:00   29
2017-12-08 01:10:00   56
2017-12-08 01:25:00   82
2017-12-08 01:40:00   13
2017-12-08 01:55:00   35
2017-12-08 02:10:00   53
2017-12-08 02:25:00   25
2017-12-08 02:40:00   23
2017-12-08 02:55:00   21
2017-12-08 03:10:00   12
2017-12-08 03:25:00   15

它应该返回此DataFrame,而没有检测到最大值的时间:

it should return this DataFrame, without the time when a max was detected:

                     VAL1  VAL2  VAL3
TIME 
2017-12-08 00:00:00   29   None  None
2017-12-08 01:00:00   82    56    35
2017-12-08 02:00:00   53    25    23
2017-12-08 03:00:00   15    12   None

None在少于3行的组中.

生成数据集的代码为:

from datetime import *
import pandas as pd
import numpy as np

df = pd.DataFrame()

date_ref = datetime(2017,12,8,0,55,0)
days = pd.date_range(date_ref, date_ref + timedelta(0.11), freq='15min')

np.random.seed(seed=1111)
data1 = np.random.randint(1, high=100, size=len(days))

df = pd.DataFrame({'TIME': days, 'VAL': data1})
df = df.set_index('TIME')

# groupby
group1 = df.groupby(pd.Grouper(freq='1H'))
largest3 = pd.DataFrame(group1["VAL"].nlargest(3))

我的问题是如何将这些值保存到新的DataFrame中,也许是从largest3中获取它们:

My question is how is it possible to save these values into a new DataFrame, perhaps getting them from largest3:

                                         VAL
TIME                TIME                    
2017-12-08 00:00:00 2017-12-08 00:55:00   29
2017-12-08 01:00:00 2017-12-08 01:25:00   82
                    2017-12-08 01:10:00   56
                    2017-12-08 01:55:00   35
2017-12-08 02:00:00 2017-12-08 02:10:00   53
                    2017-12-08 02:25:00   25
                    2017-12-08 02:40:00   23
2017-12-08 03:00:00 2017-12-08 03:25:00   15
                    2017-12-08 03:10:00   12

添加了reset_index

largest3 = pd.DataFrame(group1["VAL"].nlargest(3)).reset_index(level=1, drop=True)

返回了更好的概述,但我不知道如何从这里继续前进:

returns a better overview but I don't know how to move on from here:

                     VAL
TIME                    
2017-12-08 00:00:00   29
2017-12-08 01:00:00   82
2017-12-08 01:00:00   56
2017-12-08 01:00:00   35
2017-12-08 02:00:00   53
2017-12-08 02:00:00   25
2017-12-08 02:00:00   23
2017-12-08 03:00:00   15
2017-12-08 03:00:00   12

推荐答案

诀窍是创建一个不基于set_index + modulus的索引,并且cumcount在组内提供一个累进计数器:

The trick is to create an index that is not based on set_index+modulus, and cumcount provides a progressive counter inside a group:

largest3 = (pd.DataFrame(group1["VAL"]
    .nlargest(3))
    .reset_index(level=1, drop=True))

largest3['index'] = largest3.groupby('TIME').cumcount()  # temporary index

largest3 = (largest3.set_index("index", append=True)['VAL']
    .unstack()
    .add_prefix('VAL'))

结果是按要求的:

index                VAL0  VAL1  VAL2
TIME                                 
2017-12-08 00:00:00  29.0   NaN   NaN
2017-12-08 01:00:00  82.0  56.0  35.0
2017-12-08 02:00:00  53.0  25.0  23.0
2017-12-08 03:00:00  15.0  12.0   NaN

这篇关于pandas groupby:每个组中的前3个值并存储在DataFrame中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆