Python Pandas中的常规Groupby:快速方法 [英] General Groupby in Python Pandas: Fast way
问题描述
最终问题
是否有一种方法可以执行不依赖于pd.groupby的常规,高性能groupby操作?
输入
pd.DataFrame([[1,'2020-02-01','a'],[1,'2020-02-10','b'],[1,'2020-02-17','c'],[2,'2020-02-02','d'],[2,'2020-03-06','b'],[2,'2020-04-17','c']],columns = ['id','begin_date','status']))
id的开始日期状态0 1 2020-02-01 a1 1 2020-02-10 b2 1 2020-02-17 c3 2 2020-02-02 d4 2 2020-03-06乙
所需的输出
id状态计数uniquecount0 1一1 11 1 b 1 12 1 c 1 13 2 b 1 14 2 c 1 1
问题
现在,有一种简单的方法可以在Python中使用Pandas来实现.
df = df.groupby([[id],"status"]).agg(count =("begin_date","count"),uniquecount =("begin_date",lambdax:x.nunique()).reset_index()#如所评论的,省略lambda并将其替换为"begin_date","nunique".会更快.谢谢!
对于较大的数据集,此操作很慢,我猜测是O(n²).
缺乏理想通用性的现有解决方案
现在,经过一番谷歌搜索之后,StackOverflow上有一些替代解决方案,可以使用numpy,迭代或其他方式.
Ultimate Question
Is there a way to do a general, performant groupby-operation that does not rely on pd.groupby?
Input
pd.DataFrame([[1, '2020-02-01', 'a'], [1, '2020-02-10', 'b'], [1, '2020-02-17', 'c'], [2, '2020-02-02', 'd'], [2, '2020-03-06', 'b'], [2, '2020-04-17', 'c']], columns=['id', 'begin_date', 'status'])`
id begin_date status
0 1 2020-02-01 a
1 1 2020-02-10 b
2 1 2020-02-17 c
3 2 2020-02-02 d
4 2 2020-03-06 b
Desired Output
id status count uniquecount
0 1 a 1 1
1 1 b 1 1
2 1 c 1 1
3 2 b 1 1
4 2 c 1 1
Problem
Now, there is an easy way to do that in Python, using Pandas.
df = df.groupby(["id", "status"]).agg(count=("begin_date", "count"), uniquecount=("begin_date", lambda x: x.nunique())).reset_index()
# As commented, omitting the lambda and replacing it with "begin_date", "nunique" will be faster. Thanks!
This operation is slow for larger datasets, I'd take a guess and say O(n²).
Existent solutions that lack the desired general applicability
Now, after some googling, there are some alternative solutions on StackOverflow, either using numpy, iterrows, or different other ways.
Faster alternative to perform pandas groupby operation
Pandas fast weighted random choice from groupby
And an excellent one:
Groupby in python pandas: Fast Way
These solutions generally aim to create the "count" or "uniquecount" in my example, basically the aggregated value. But, unfortunately, always only one aggregation and not with multiple groupby columns. Also, they unfortunately never explain how to merge them into the grouped dataframe.
Is there a way to use itertools (Like this answer: Faster alternative to perform pandas groupby operation, or even better this answer: Groupby in python pandas: Fast Way) that do not only return the series "count", but the whole dataframe in grouped form?
Ultimate Question
Is there a way to do a general, performant groupby-operation that does not rely on pd.groupby?
This would look something like this:
from typing import List
def fastGroupby(df, groupbyColumns: List[str], aggregateColumns):
# numpy / iterrow magic
return df_grouped
df = fastGroupby(df, ["id", "status"], {'status': 'count',
'status': 'count'}
And return the desired output.
Before ditching groupby
I'd suggest first evaluating whether you are truly taking advantage of what groupby
has to offer.
Do away with lambda
in favor of built-in pd.DataFrameGroupBy
methods.
Many of the Series
and DataFrame
methods are implemented as pd.DataFrameGroupBy
methods. You should use those directly as opposed to calling them with a groupby
+ apply(lambda x: ...)
Further, for many calculations you can re-frame the problem as some vectorized operation on an entire DataFrame that then uses a groupby method implemented in cython. This will be fast.
A common example of this would be finding the proportion of 'Y'
answers within a group. A straight-forward approach would be to check the condition within each group then get the proportion:
N = 10**6
df = pd.DataFrame({'grp': np.random.choice(range(10000), N),
'answer': np.random.choice(['Y', 'N'], N)})
df.groupby('grp')['answer'].apply(lambda x: x.eq('Y').mean())
Thinking about the problem this way requires the lambda
, because we do two operations within the groupby; check the condition then average. This exact same calculation can be thought of as first checking the condition on the entire DataFrame then calculating average within group:
df['answer'].eq('Y').groupby(df['grp']).mean()
This is a very minor change yet the consequences are huge, and the gains will become greater as the number of groups increases.
%timeit df.groupby('grp')['answer'].apply(lambda x: x.eq('Y').mean())
#2.32 s ± 99.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df['answer'].eq('Y').groupby(df['grp']).mean()
#82.8 ms ± 995 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Add sort=False
as an argument
By default groupby
sorts the output on the keys. If there is no reason to have a sorted output you can get a slight gain specifying sort=False
Add observed=True
as an argument
If grouping keys are categorical it will reindex to all possible combinations, even for groups that never appear in your DataFrame. If these are not important, removing them from the output will greatly improve the speed.
For your example we can examine the difference. There's an enormous gain switching to pd.DataFrameGroupBy.nunique
and removing the sorting adds a little extra speed. The combination of both gives an "identical" solution (up to sorting), and is nearly 100x faster for many groups.
import perfplot
import pandas as pd
import numpy
def agg_lambda(df):
return df.groupby(['id', 'status']).agg(uniquecount=('Col4', lambda x: x.nunique()))
def agg_nunique(df):
return df.groupby(['id', 'status']).agg(uniquecount=('Col4', 'nunique'))
def agg_nunique_nosort(df):
return df.groupby(['id', 'status'], sort=False).agg(uniquecount=('Col4', 'nunique'))
perfplot.show(
setup=lambda N: pd.DataFrame({'Col1': range(N),
'status': np.random.choice(np.arange(N), N),
'id': np.random.choice(np.arange(N), N),
'Col4': np.random.choice(np.arange(N), N)}),
kernels=[
lambda df: agg_lambda(df),
lambda df: agg_nunique(df),
lambda df: agg_nunique_nosort(df),
],
labels=['Agg Lambda', 'Agg Nunique', 'Agg Nunique, No sort'],
n_range=[2 ** k for k in range(20)],
# Equality check same data, just allow for different sorting
equality_check=lambda x,y: x.sort_index().compare(y.sort_index()).empty,
xlabel="~ Number of Groups"
)
这篇关于Python Pandas中的常规Groupby:快速方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!