将多个函数应用于多个groupby列 [英] Apply multiple functions to multiple groupby columns

查看:108
本文介绍了将多个函数应用于多个groupby列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

文档展示了如何应用多个在一个groupby对象上一次使用输出列名作为关键字的dict:

  In [563]:分组['D']。agg({'result1':np.sum,
.....:'result2':np.mean})
.....:
[563]:
result2 result1
A
bar -0.579846 -1.739537
foo -0.280588 -1.402938

但是,这仅适用于Series groupby对象。当一个字典同样通过DataFrame传递给一个组时,它希望这些键是该函数将应用到的列名。



我想要做的是对多个列应用多个函数(但某些列将多次运行)。此外,一些函数将依赖groupby对象中的其他列(如sumif函数)。我目前的解决方案是逐列,并执行类似上面的代码,使用lambdas函数依赖其他行。但是这需要很长时间,(我认为遍历groupby对象需要很长时间)。我必须改变它,以便在一次运行中迭代整个groupby对象,但是我想知道是否有内置的方式在熊猫中做到这一点干净。



例如,我试过类似于

  grouped.agg({'C_sum':lambda x: x ['C']。sum(),
'C_std':lambda x:x ['C']。std(),
'D_sum':lambda x:x ['D'] .sum()},
'D_sumifC3':lambda x:x ['D'] [x ['C'] == 3] .sum(),...)

但是如我所料,我得到一个KeyError(因为如果 agg 从DataFrame调用)。

是否有任何内置的方法来执行我想要执行的操作,或者可能会添加此功能,或者我只需要手动遍历groupby



谢谢

解决方案

对于第一部分,您可以传递列名称的字典对于键和值的函数列表:

 在[28]中:df 
Out [28]:
ABCDE GRP
0 0.395670 0.219560 0.600644 0.613445 0.242893 0
1 0.323911 0.464584 0.107215 0.204072 0.927325 0
2 0.321358 0.076037 0.166946 0.439661 0.914612 1
3 0.133466 0.447946 0.014815 0.130781 0.268290 1

在[26]中:f = {'A':['sum','mean'],'B':['prod']}

In [ 27]:df.groupby('GRP')。agg(f)
输出[27]:
AB
总和平均值产品
GRP
0 0.719580 0.359790 0.102004
1 0.454824 0.227412 0.034060

更新1:



因为集合函数在Series上起作用,所以对其他列名的引用将丢失。为了解决这个问题,您可以引用完整的数据框并使用lambda函数中的组索引对其进行索引。



这是一个很好的解决方法:
'b $ b

  In [67]:f = {'A':['sum','mean'],'B':['prod'],'' D':lambda g:df.loc [g.index] .E.sum()} 

在[69]中:df.groupby('GRP')。agg(f)
Out [69]:
ABD
sum mean prod< lambda>
GRP
0 0.719580 0.359790 0.102004 1.170219
1 0.454824 0.227412 0.034060 1.182901

在这里,结果的'D'列由总和'E'值组成。



更新2:

以下是我认为可以满足你所要求的所有方法。首先制作一个自定义lambda函数。下面,g引用该组。汇总时,g将是一个系列。将 g.index 传递给 df.ix [] 从df中选择当前组。然后我测试C列是否小于0.5。返回的布尔序列传递给 g [] ,它只选择符合条件的行。 In [95]:cust = lambda g:g [df.loc [g.index] ['C']< 0.5] .sum()

在[96]中:f = {'A':['sum','mean'],'B':['prod'],'D': {'my name':cust}}

在[97]中:df.groupby('GRP')。agg(f)
Out [97]:
ABD
总计表示我的名字
GRP
0 0.719580 0.359790 0.102004 0.204072
1 0.454824 0.227412 0.034060 0.570441


The docs show how to apply multiple functions on a groupby object at a time using a dict with the output column names as the keys:

In [563]: grouped['D'].agg({'result1' : np.sum,
   .....:                   'result2' : np.mean})
   .....:
Out[563]: 
      result2   result1
A                      
bar -0.579846 -1.739537
foo -0.280588 -1.402938

However, this only works on a Series groupby object. And when a dict is similarly passed to a groupby DataFrame, it expects the keys to be the column names that the function will be applied to.

What I want to do is apply multiple functions to several columns (but certain columns will be operated on multiple times). Also, some functions will depend on other columns in the groupby object (like sumif functions). My current solution is to go column by column, and doing something like the code above, using lambdas for functions that depend on other rows. But this is taking a long time, (I think it takes a long time to iterate through a groupby object). I'll have to change it so that I iterate through the whole groupby object in a single run, but I'm wondering if there's a built in way in pandas to do this somewhat cleanly.

For example, I've tried something like

grouped.agg({'C_sum' : lambda x: x['C'].sum(),
             'C_std': lambda x: x['C'].std(),
             'D_sum' : lambda x: x['D'].sum()},
             'D_sumifC3': lambda x: x['D'][x['C'] == 3].sum(), ...)

but as expected I get a KeyError (since the keys have to be a column if agg is called from a DataFrame).

Is there any built in way to do what I'd like to do, or a possibility that this functionality may be added, or will I just need to iterate through the groupby manually?

Thanks

解决方案

For the first part you can pass a dict of column names for keys and a list of functions for the values:

In [28]: df
Out[28]:
          A         B         C         D         E  GRP
0  0.395670  0.219560  0.600644  0.613445  0.242893    0
1  0.323911  0.464584  0.107215  0.204072  0.927325    0
2  0.321358  0.076037  0.166946  0.439661  0.914612    1
3  0.133466  0.447946  0.014815  0.130781  0.268290    1

In [26]: f = {'A':['sum','mean'], 'B':['prod']}

In [27]: df.groupby('GRP').agg(f)
Out[27]:
            A                   B
          sum      mean      prod
GRP
0    0.719580  0.359790  0.102004
1    0.454824  0.227412  0.034060

UPDATE 1:

Because the aggregate function works on Series, references to the other column names are lost. To get around this, you can reference the full dataframe and index it using the group indices within the lambda function.

Here's a hacky workaround:

In [67]: f = {'A':['sum','mean'], 'B':['prod'], 'D': lambda g: df.loc[g.index].E.sum()}

In [69]: df.groupby('GRP').agg(f)
Out[69]:
            A                   B         D
          sum      mean      prod  <lambda>
GRP
0    0.719580  0.359790  0.102004  1.170219
1    0.454824  0.227412  0.034060  1.182901

Here, the resultant 'D' column is made up of the summed 'E' values.

UPDATE 2:

Here's a method that I think will do everything you ask. First make a custom lambda function. Below, g references the group. When aggregating, g will be a Series. Passing g.index to df.ix[] selects the current group from df. I then test if column C is less than 0.5. The returned boolean series is passed to g[] which selects only those rows meeting the criteria.

In [95]: cust = lambda g: g[df.loc[g.index]['C'] < 0.5].sum()

In [96]: f = {'A':['sum','mean'], 'B':['prod'], 'D': {'my name': cust}}

In [97]: df.groupby('GRP').agg(f)
Out[97]:
            A                   B         D
          sum      mean      prod   my name
GRP
0    0.719580  0.359790  0.102004  0.204072
1    0.454824  0.227412  0.034060  0.570441

这篇关于将多个函数应用于多个groupby列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆