Python Pandas中的常规Groupby:快速方法 [英] General Groupby in Python Pandas: Fast way

查看:79
本文介绍了Python Pandas中的常规Groupby:快速方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

最终问题

是否有一种方法可以执行不依赖于pd.groupby的常规,高性能groupby操作?

输入

  pd.DataFrame([[1,'2020-02-01','a'],[1,'2020-02-10','b'],[1,'2020-02-17','c'],[2,'2020-02-02','d'],[2,'2020-03-06','b'],[2,'2020-04-17','c']],columns = ['id','begin_date','status'])) 

  id的开始日期状态0 1 2020-02-01 a1 1 2020-02-10 b2 1 2020-02-17 c3 2 2020-02-02 d4 2 2020-03-06乙 

所需的输出

  id状态计数uniquecount0 1一1 11 1 b 1 12 1 c 1 13 2 b 1 14 2 c 1 1 

问题

现在,有一种简单的方法可以在Python中使用Pandas来实现.

  df = df.groupby([[id],"status"]).agg(count =("begin_date","count"),uniquecount =("begin_date",lambdax:x.nunique()).reset_index()#如所评论的,省略lambda并将其替换为"begin_date","nunique".会更快.谢谢! 

对于较大的数据集,此操作很慢,我猜测是O(n²).

缺乏理想通用性的现有解决方案

现在,经过一番谷歌搜索之后,StackOverflow上有一些替代解决方案,可以使用numpy,迭代或其他方式.

Ultimate Question

Is there a way to do a general, performant groupby-operation that does not rely on pd.groupby?

Input

pd.DataFrame([[1, '2020-02-01', 'a'], [1, '2020-02-10', 'b'], [1, '2020-02-17', 'c'], [2, '2020-02-02', 'd'], [2, '2020-03-06', 'b'], [2, '2020-04-17', 'c']], columns=['id', 'begin_date', 'status'])`

   id  begin_date status
0   1  2020-02-01      a
1   1  2020-02-10      b
2   1  2020-02-17      c
3   2  2020-02-02      d
4   2  2020-03-06      b

Desired Output

   id status  count  uniquecount
0   1      a      1            1
1   1      b      1            1
2   1      c      1            1
3   2      b      1            1
4   2      c      1            1

Problem

Now, there is an easy way to do that in Python, using Pandas.

df = df.groupby(["id", "status"]).agg(count=("begin_date", "count"), uniquecount=("begin_date", lambda x: x.nunique())).reset_index()
# As commented, omitting the lambda and replacing it with "begin_date", "nunique" will be faster. Thanks!

This operation is slow for larger datasets, I'd take a guess and say O(n²).

Existent solutions that lack the desired general applicability

Now, after some googling, there are some alternative solutions on StackOverflow, either using numpy, iterrows, or different other ways.

Faster alternative to perform pandas groupby operation

Pandas fast weighted random choice from groupby

And an excellent one:

Groupby in python pandas: Fast Way

These solutions generally aim to create the "count" or "uniquecount" in my example, basically the aggregated value. But, unfortunately, always only one aggregation and not with multiple groupby columns. Also, they unfortunately never explain how to merge them into the grouped dataframe.

Is there a way to use itertools (Like this answer: Faster alternative to perform pandas groupby operation, or even better this answer: Groupby in python pandas: Fast Way) that do not only return the series "count", but the whole dataframe in grouped form?

Ultimate Question

Is there a way to do a general, performant groupby-operation that does not rely on pd.groupby?

This would look something like this:

from typing import List
def fastGroupby(df, groupbyColumns: List[str], aggregateColumns):
    # numpy / iterrow magic
    return df_grouped

df = fastGroupby(df, ["id", "status"], {'status': 'count',
                             'status': 'count'}

And return the desired output.

解决方案

Before ditching groupby I'd suggest first evaluating whether you are truly taking advantage of what groupby has to offer.

Do away with lambda in favor of built-in pd.DataFrameGroupBy methods.

Many of the Series and DataFrame methods are implemented as pd.DataFrameGroupBy methods. You should use those directly as opposed to calling them with a groupby + apply(lambda x: ...)

Further, for many calculations you can re-frame the problem as some vectorized operation on an entire DataFrame that then uses a groupby method implemented in cython. This will be fast.

A common example of this would be finding the proportion of 'Y' answers within a group. A straight-forward approach would be to check the condition within each group then get the proportion:

N = 10**6
df = pd.DataFrame({'grp': np.random.choice(range(10000), N),
                   'answer': np.random.choice(['Y', 'N'], N)})

df.groupby('grp')['answer'].apply(lambda x: x.eq('Y').mean())

Thinking about the problem this way requires the lambda, because we do two operations within the groupby; check the condition then average. This exact same calculation can be thought of as first checking the condition on the entire DataFrame then calculating average within group:

df['answer'].eq('Y').groupby(df['grp']).mean()

This is a very minor change yet the consequences are huge, and the gains will become greater as the number of groups increases.

%timeit df.groupby('grp')['answer'].apply(lambda x: x.eq('Y').mean())
#2.32 s ± 99.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df['answer'].eq('Y').groupby(df['grp']).mean()
#82.8 ms ± 995 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


Add sort=False as an argument

By default groupby sorts the output on the keys. If there is no reason to have a sorted output you can get a slight gain specifying sort=False


Add observed=True as an argument

If grouping keys are categorical it will reindex to all possible combinations, even for groups that never appear in your DataFrame. If these are not important, removing them from the output will greatly improve the speed.


For your example we can examine the difference. There's an enormous gain switching to pd.DataFrameGroupBy.nunique and removing the sorting adds a little extra speed. The combination of both gives an "identical" solution (up to sorting), and is nearly 100x faster for many groups.

import perfplot
import pandas as pd
import numpy

def agg_lambda(df):
    return df.groupby(['id', 'status']).agg(uniquecount=('Col4', lambda x: x.nunique()))
    
def agg_nunique(df):
    return df.groupby(['id', 'status']).agg(uniquecount=('Col4', 'nunique'))

def agg_nunique_nosort(df):
    return df.groupby(['id', 'status'], sort=False).agg(uniquecount=('Col4', 'nunique'))

perfplot.show(
    setup=lambda N: pd.DataFrame({'Col1': range(N),
                       'status': np.random.choice(np.arange(N), N),
                       'id': np.random.choice(np.arange(N), N),
                       'Col4': np.random.choice(np.arange(N), N)}),
    kernels=[
        lambda df: agg_lambda(df),
        lambda df: agg_nunique(df),
        lambda df: agg_nunique_nosort(df),
    ],
    labels=['Agg Lambda', 'Agg Nunique', 'Agg Nunique, No sort'],
    n_range=[2 ** k for k in range(20)],
    # Equality check same data, just allow for different sorting
    equality_check=lambda x,y: x.sort_index().compare(y.sort_index()).empty,
    xlabel="~ Number of Groups"
)

这篇关于Python Pandas中的常规Groupby:快速方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆