筛选多索引分组大 pandas 数据框 [英] filter multi-indexed grouped pandas dataframe
问题描述
数据如下所示:
id timestamp date value
1 2001-01-01 2001-05-01 0
1 2001-10-01 2001-05-01 1
2 2001-01-01 2001-05-01 0
2 2001-10-01 2001-05-01 0
如您所见,
该表包含列id
,timestamp
,date
和value
.
具有相同id
的每一行也具有相同的date
.
此外,date
在时间上总是位于每个id
的第一个timestamp
与最后一个timestamp
之间.
任务是要对表格进行过滤,以删除每个id
,每个id
在其各自的date
之后的时间戳上均不包含至少一个带有value > 0
的条目.>
我用level 0 = id
和level 1 = date
对表进行多索引并对其进行排序的方式实现了它.然后按level 0
将其分组.接下来,我遍历每个组(id
),并分配一个新值,告诉我id
是否为"good"(布尔值).最终,我过滤了True
为好的表.
不幸的是,对于大型(> 1000万行)数据集,此实现速度很慢.
我正在寻找一种加快速度的方法.我的想法是使用groupby.apply(lambda g: something)
,但我没有使它起作用,我不知道这是否是最快的选择.
工作代码示例:
import pandas as pd
df = pd.DataFrame({'id': [1, 1, 2, 2],
'timestamp': ['01-01-2001', '01-10-2001', '01-01-2001', '01-10-2001'],
'date': ['01-05-2001', '01-05-2001', '01-05-2001', '01-05-2001'],
'value': [0, 1, 0, 0]})
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['date'] = pd.to_datetime(df['date'])
df = df.set_index(['id','timestamp']).sort_index()
grouped = df.groupby(level=0)
df['good'] = False
for i,(id,df_id) in enumerate(grouped):
index = df_id.index
df_id = df_id.droplevel(0)
df.good.loc[index] = any(df_id.value.loc[df_id.date[0]:] > 0)
df = df[df.good == True]
要在value
列中使用1
获取所有id
,并且timestamp
更高,例如date
通过 Series.gt
,由AND
,然后通过GroupBy.any
和True.
Unfortunately this implementation is slow like hell for a big (>10M rows) dataset.
I am looking for a way to speed this up. My idea was using groupby.apply(lambda g: something)
but I did not get it to work and I do not know if this is the fastest option possible.
Working Code Example:
import pandas as pd
df = pd.DataFrame({'id': [1, 1, 2, 2],
'timestamp': ['01-01-2001', '01-10-2001', '01-01-2001', '01-10-2001'],
'date': ['01-05-2001', '01-05-2001', '01-05-2001', '01-05-2001'],
'value': [0, 1, 0, 0]})
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['date'] = pd.to_datetime(df['date'])
df = df.set_index(['id','timestamp']).sort_index()
grouped = df.groupby(level=0)
df['good'] = False
for i,(id,df_id) in enumerate(grouped):
index = df_id.index
df_id = df_id.droplevel(0)
df.good.loc[index] = any(df_id.value.loc[df_id.date[0]:] > 0)
df = df[df.good == True]
For get all id
s by 1
in value
column and also timestamp
are higher like date
create 2 masks by Series.gt
, chain by &
for bitwise AND
and then test if at least one True
per group by GroupBy.any
and GroupBy.transform
:
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(['id','timestamp'])
m = df['value'].gt(0) & df['timestamp'].gt(df['date'])
df = df[m.groupby(df['id']).transform('any')]
print (df)
id timestamp date value
0 1 2001-01-01 2001-01-05 0
1 1 2001-01-10 2001-01-05 1
这篇关于筛选多索引分组大 pandas 数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!