将多个函数应用于多个groupby列 [英] Apply multiple functions to multiple groupby columns
问题描述
文档展示了如何应用多个在一个groupby对象上一次使用输出列名作为关键字的dict:
In [563]:分组['D']。agg({'result1':np.sum,
.....:'result2':np.mean})
.....:
[563]:
result2 result1
A
bar -0.579846 -1.739537
foo -0.280588 -1.402938
但是,这仅适用于Series groupby对象。当一个字典同样通过DataFrame传递给一个组时,它希望这些键是该函数将应用到的列名。
我想要做的是对多个列应用多个函数(但某些列将多次运行)。此外,一些函数将依赖groupby对象中的其他列(如sumif函数)。我目前的解决方案是逐列,并执行类似上面的代码,使用lambdas函数依赖其他行。但是这需要很长时间,(我认为遍历groupby对象需要很长时间)。我必须改变它,以便在一次运行中迭代整个groupby对象,但是我想知道是否有内置的方式在熊猫中做到这一点干净。
例如,我试过类似于
grouped.agg({'C_sum':lambda x: x ['C']。sum(),
'C_std':lambda x:x ['C']。std(),
'D_sum':lambda x:x ['D'] .sum()},
'D_sumifC3':lambda x:x ['D'] [x ['C'] == 3] .sum(),...)
但是如我所料,我得到一个KeyError(因为如果 是否有任何内置的方法来执行我想要执行的操作,或者可能会添加此功能,或者我只需要手动遍历groupby 谢谢 对于第一部分,您可以传递列名称的字典对于键和值的函数列表: 更新1: 因为集合函数在Series上起作用,所以对其他列名的引用将丢失。为了解决这个问题,您可以引用完整的数据框并使用lambda函数中的组索引对其进行索引。 这是一个很好的解决方法: 在这里,结果的'D'列由总和'E'值组成。 更新2: 以下是我认为可以满足你所要求的所有方法。首先制作一个自定义lambda函数。下面,g引用该组。汇总时,g将是一个系列。将 The docs show how to apply multiple functions on a groupby object at a time using a dict with the output column names as the keys: However, this only works on a Series groupby object. And when a dict is similarly passed to a groupby DataFrame, it expects the keys to be the column names that the function will be applied to. What I want to do is apply multiple functions to several columns (but certain columns will be operated on multiple times). Also, some functions will depend on other columns in the groupby object (like sumif functions). My current solution is to go column by column, and doing something like the code above, using lambdas for functions that depend on other rows. But this is taking a long time, (I think it takes a long time to iterate through a groupby object). I'll have to change it so that I iterate through the whole groupby object in a single run, but I'm wondering if there's a built in way in pandas to do this somewhat cleanly. For example, I've tried something like but as expected I get a KeyError (since the keys have to be a column if Is there any built in way to do what I'd like to do, or a possibility that this functionality may be added, or will I just need to iterate through the groupby manually? Thanks For the first part you can pass a dict of column names for keys and a list of functions for the values: UPDATE 1: Because the aggregate function works on Series, references to the other column names are lost. To get around this, you can reference the full dataframe and index it using the group indices within the lambda function. Here's a hacky workaround: Here, the resultant 'D' column is made up of the summed 'E' values. UPDATE 2: Here's a method that I think will do everything you ask. First make a custom lambda function. Below, g references the group. When aggregating, g will be a Series. Passing
这篇关于将多个函数应用于多个groupby列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋! agg $ c,键必须是一个列$ c $>从DataFrame调用)。
在[28]中:df
Out [28]:
ABCDE GRP
0 0.395670 0.219560 0.600644 0.613445 0.242893 0
1 0.323911 0.464584 0.107215 0.204072 0.927325 0
2 0.321358 0.076037 0.166946 0.439661 0.914612 1
3 0.133466 0.447946 0.014815 0.130781 0.268290 1
在[26]中:f = {'A':['sum','mean'],'B':['prod']}
In [ 27]:df.groupby('GRP')。agg(f)
输出[27]:
AB
总和平均值产品
GRP
0 0.719580 0.359790 0.102004
1 0.454824 0.227412 0.034060
'b $ b
In [67]:f = {'A':['sum','mean'],'B':['prod'],'' D':lambda g:df.loc [g.index] .E.sum()}
在[69]中:df.groupby('GRP')。agg(f)
Out [69]:
ABD
sum mean prod< lambda>
GRP
0 0.719580 0.359790 0.102004 1.170219
1 0.454824 0.227412 0.034060 1.182901
g.index
传递给 df.ix []
从df中选择当前组。然后我测试C列是否小于0.5。返回的布尔序列传递给 g []
,它只选择符合条件的行。 In [95]:cust = lambda g:g [df.loc [g.index] ['C']< 0.5] .sum()
在[96]中:f = {'A':['sum','mean'],'B':['prod'],'D': {'my name':cust}}
在[97]中:df.groupby('GRP')。agg(f)
Out [97]:
ABD
总计表示我的名字
GRP
0 0.719580 0.359790 0.102004 0.204072
1 0.454824 0.227412 0.034060 0.570441
In [563]: grouped['D'].agg({'result1' : np.sum,
.....: 'result2' : np.mean})
.....:
Out[563]:
result2 result1
A
bar -0.579846 -1.739537
foo -0.280588 -1.402938
grouped.agg({'C_sum' : lambda x: x['C'].sum(),
'C_std': lambda x: x['C'].std(),
'D_sum' : lambda x: x['D'].sum()},
'D_sumifC3': lambda x: x['D'][x['C'] == 3].sum(), ...)
agg
is called from a DataFrame).In [28]: df
Out[28]:
A B C D E GRP
0 0.395670 0.219560 0.600644 0.613445 0.242893 0
1 0.323911 0.464584 0.107215 0.204072 0.927325 0
2 0.321358 0.076037 0.166946 0.439661 0.914612 1
3 0.133466 0.447946 0.014815 0.130781 0.268290 1
In [26]: f = {'A':['sum','mean'], 'B':['prod']}
In [27]: df.groupby('GRP').agg(f)
Out[27]:
A B
sum mean prod
GRP
0 0.719580 0.359790 0.102004
1 0.454824 0.227412 0.034060
In [67]: f = {'A':['sum','mean'], 'B':['prod'], 'D': lambda g: df.loc[g.index].E.sum()}
In [69]: df.groupby('GRP').agg(f)
Out[69]:
A B D
sum mean prod <lambda>
GRP
0 0.719580 0.359790 0.102004 1.170219
1 0.454824 0.227412 0.034060 1.182901
g.index
to df.ix[]
selects the current group from df. I then test if column C is less than 0.5. The returned boolean series is passed to g[]
which selects only those rows meeting the criteria.In [95]: cust = lambda g: g[df.loc[g.index]['C'] < 0.5].sum()
In [96]: f = {'A':['sum','mean'], 'B':['prod'], 'D': {'my name': cust}}
In [97]: df.groupby('GRP').agg(f)
Out[97]:
A B D
sum mean prod my name
GRP
0 0.719580 0.359790 0.102004 0.204072
1 0.454824 0.227412 0.034060 0.570441