筛选多索引分组大 pandas 数据框 [英] filter multi-indexed grouped pandas dataframe

查看:75
本文介绍了筛选多索引分组大 pandas 数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

数据如下所示:

id  timestamp   date        value
1   2001-01-01  2001-05-01  0
1   2001-10-01  2001-05-01  1
2   2001-01-01  2001-05-01  0
2   2001-10-01  2001-05-01  0

如您所见,

该表包含列idtimestampdatevalue. 具有相同id的每一行也具有相同的date. 此外,date在时间上总是位于每个id的第一个timestamp与最后一个timestamp之间.

任务是要对表格进行过滤,以删除每个id,每个id在其各自的date之后的时间戳上均不包含至少一个带有value > 0的条目.

我用level 0 = idlevel 1 = date对表进行多索引并对其进行排序的方式实现了它.然后按level 0将其分组.接下来,我遍历每个组(id),并分配一个新值,告诉我id是否为"good"(布尔值).最终,我过滤了True为好的表.

不幸的是,对于大型(> 1000万行)数据集,此实现速度很慢. 我正在寻找一种加快速度的方法.我的想法是使用groupby.apply(lambda g: something),但我没有使它起作用,我不知道这是否是最快的选择.

工作代码示例:

import pandas as pd

df = pd.DataFrame({'id': [1, 1, 2, 2],
                   'timestamp': ['01-01-2001', '01-10-2001', '01-01-2001', '01-10-2001'], 
                   'date': ['01-05-2001', '01-05-2001', '01-05-2001', '01-05-2001'],
                   'value': [0, 1, 0, 0]})                               

df['timestamp'] = pd.to_datetime(df['timestamp'])
df['date'] = pd.to_datetime(df['date'])
df = df.set_index(['id','timestamp']).sort_index()
grouped = df.groupby(level=0)
df['good'] = False
for i,(id,df_id) in enumerate(grouped):
    index = df_id.index
    df_id = df_id.droplevel(0)
    df.good.loc[index] = any(df_id.value.loc[df_id.date[0]:] > 0)
df = df[df.good == True]

解决方案

要在value列中使用1获取所有id,并且timestamp更高,例如date通过 Series.gt ,由按位AND,然后通过True. GroupBy.any.html"rel =" nofollow noreferrer> GroupBy.anyTrue.

Unfortunately this implementation is slow like hell for a big (>10M rows) dataset. I am looking for a way to speed this up. My idea was using groupby.apply(lambda g: something) but I did not get it to work and I do not know if this is the fastest option possible.

Working Code Example:

import pandas as pd

df = pd.DataFrame({'id': [1, 1, 2, 2],
                   'timestamp': ['01-01-2001', '01-10-2001', '01-01-2001', '01-10-2001'], 
                   'date': ['01-05-2001', '01-05-2001', '01-05-2001', '01-05-2001'],
                   'value': [0, 1, 0, 0]})                               

df['timestamp'] = pd.to_datetime(df['timestamp'])
df['date'] = pd.to_datetime(df['date'])
df = df.set_index(['id','timestamp']).sort_index()
grouped = df.groupby(level=0)
df['good'] = False
for i,(id,df_id) in enumerate(grouped):
    index = df_id.index
    df_id = df_id.droplevel(0)
    df.good.loc[index] = any(df_id.value.loc[df_id.date[0]:] > 0)
df = df[df.good == True]

解决方案

For get all ids by 1 in value column and also timestamp are higher like date create 2 masks by Series.gt, chain by & for bitwise AND and then test if at least one True per group by GroupBy.any and GroupBy.transform:

df['timestamp'] = pd.to_datetime(df['timestamp'])
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(['id','timestamp'])

m = df['value'].gt(0) & df['timestamp'].gt(df['date'])
df = df[m.groupby(df['id']).transform('any')]
print (df)
   id  timestamp       date  value
0   1 2001-01-01 2001-01-05      0
1   1 2001-01-10 2001-01-05      1

这篇关于筛选多索引分组大 pandas 数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆