如何更改pandas groupby.agg 函数的输入参数? [英] How do you change input parameters of pandas groupby.agg function?

查看:50
本文介绍了如何更改pandas groupby.agg 函数的输入参数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在将 groupby_object.agg() 方法与我想更改输入参数的函数一起使用时遇到问题..agg() 是否有可用的函数名称资源,以及如何将参数传递给它们?

I am having issues using the groupby_object.agg() method with functions where I want to change the input parameters. Is there a resource available of function names .agg() accepts, and how to pass parameters to them?

看下面的例子:

import pandas as pd
import numpy as np

df = pd.DataFrame({'numbers': [1, 2, 3, 2, 1, 3], 
               'colors': ['red', 'white', 'blue', 'red', 'white', np.nan], 
               'weight': [10, 10, 20, 5, 10, 20]})

df['colors'].nunique() # Returns 3 as NaN is not counted
df['colors'].nunique(dropna=False) # Returns 4 as NaN is counted

当我然后 groupby 'colors' 时,我如何通过函数传递 dropna=False 参数?

When I then groupby 'colors' how can I pass the dropna=False parameter with the function?

df.groupby('numbers').agg({'colors': 'nunique', 'weight': 'sum'})

推荐答案

虽然 pandas 有很好的语法来聚合 dicts 和 NamedAggs,但这些会带来巨大的效率成本.原因是因为不是使用在 cython 中优化和/或实现的内置 groupby 方法,而是任何 .agg(lambda x: ...).apply(lambdax: ...) 会走一条慢得多的路.

Though pandas has nice syntax for aggregating with dicts and NamedAggs, these can come at a huge efficiency cost. The reason is because instead of using the built-in groupby methods, which are optimized and/or implemented in cython, any .agg(lambda x: ...) or .apply(lambda x: ...) is going to take a much slower path.

这意味着您应该坚持使用可以直接或通过别名引用的内置函数.只有作为最后的手段,您才应该尝试使用 lambda:

What this means is that you should stick with the built-ins you can reference directly or by alias. Only as a last resort should you try to use a lambda:

在这种特殊情况下使用

df.groupby('numbers')[['colors']].agg('nunique', dropna=False)

避免

df.groupby('numbers').agg({'colors': lambda x: x.nunique(dropna=False)})


这个例子表明,虽然输出等效,而且看似很小的变化,但在性能方面却产生了巨大的影响,尤其是当组的数量变大时.


This example shows that that while equivalent in output, and a seemingly minor change, there are enormous consequences in terms of performance, especially as the number of groups becomes large.

import perfplot
import pandas as pd
import numpy as np

def built_in(df):
    return df.groupby('numbers')[['colors']].agg('nunique', dropna=False)

def apply(df):
    return df.groupby('numbers').agg({'colors': lambda x: x.nunique(dropna=False)})

perfplot.show(
    setup=lambda n: pd.DataFrame({'numbers': np.random.randint(0, n//10+1, n),
                                  'colors': np.random.choice([np.NaN] + [*range(100)])}),
    kernels=[
        lambda df: built_in(df),
        lambda df: apply(df)],
    
    labels=['Built-In', 'Apply'],
    n_range=[2 ** k for k in range(1, 20)],
    equality_check=np.allclose,  
    xlabel='~N Groups'
)

groupby 的 .groupby() 部分并没有真正做那么多;它只是确保映射正确.因此,尽管不直观,但与使用 lambda 使用更简单的 dict 进行聚合相比,单独使用内置函数进行聚合并最终连接结果仍然要快得多.

The .groupby() part of a groupby doesn't really do that much; it simply ensures the mapping is correct. So though unintuitive, it is still much faster to aggregate with the built-in separately and concatenate the results in the end than it is to agg with a simpler dict using a lambda.

这里有一个例子也想对权重列进行sum,我们可以看到分裂仍然快很多,尽管需要手动加入

Here is an example also wanting to sum the weight column, and we can see that splitting is still a lot faster, despite needing to join manually

def built_in(df):
    return pd.concat([df.groupby('numbers')[['colors']].agg('nunique', dropna=False),
                      df.groupby('numbers')[['weight']].sum()], axis=1)

def apply(df):
    return df.groupby('numbers').agg({'colors': lambda x: x.nunique(dropna=False), 
                                      'weight': 'sum'})

perfplot.show(
    setup=lambda n: pd.DataFrame({'numbers': np.random.randint(0, n//10+1, n),
                                  'colors': np.random.choice([np.NaN] + [*range(100)]),
                                  'weight': np.random.normal(0,1,n)}),
    kernels=[
        lambda df: built_in(df),
        lambda df: apply(df)],
    
    labels=['Built-In', 'Apply'],
    n_range=[2 ** k for k in range(1, 20)],
    equality_check=np.allclose,  
    xlabel='~N Groups'
)

这篇关于如何更改pandas groupby.agg 函数的输入参数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆