pandas :特定于条件的组计算 [英] Pandas: conditional group-specific computations
问题描述
假设我有一个带有键(例如客户ID)和两个数字列C1和C2的表.我想按键(客户)对行进行分组,并在其列上运行一些汇总器,例如sum和mean.计算完组聚合器之后,我想将结果分配回DataFrame中的每个客户行(因为在每行中添加了一些客户范围的功能).
Let's say I have a table with a key (e.g. customer ID) and two numeric columns C1 and C2. I would like to group rows by the key (customer) and run some aggregators like sum and mean on its columns. After computing group aggregators I would like to assign the results back to each customer row in a DataFrame (as some customer-wide features added to each row).
我看到我可以做类似的事情
df['F1'] = df.groupby(['Key'])['C1'].transform(np.sum)
如果我只想汇总一列并能够将结果添加回DataFrame.
I can see that I can do something like
df['F1'] = df.groupby(['Key'])['C1'].transform(np.sum)
if I want to aggregate just one column and be able to add the result back to the DataFrame.
我可以使其为有条件的-是否可以仅对C2列等于某个数字X的行添加一个组中的C1列,并且仍然能够将结果添加回DataFrame中?
Can I make it conditional - can I add up C1 column in a group only for rows whose C2 column is equal to some number X and still be able to add results back to the DataFrame?
如何在诸如以下行的组合上运行聚合器:
np.sum(C1 + C2)?
How can I run aggregator on a combination of rows like:
np.sum(C1 + C2)?
最简单,最优雅的实现方式是什么?最有效的方法是什么?可以在一个路径中完成这些聚合吗?
What would be the simplest and most elegant way to implement it? What is the most efficient way to do it? Can those aggregations be done in a one path?
谢谢.
推荐答案
以下是一些虚拟数据的设置.
Here's some setup of some dummy data.
In [81]: df = pd.DataFrame({'Key': ['a','a','b','b','c','c'],
'C1': [1,2,3,4,5,6],
'C2': [7,8,9,10,11,12]})
In [82]: df['F1'] = df.groupby('Key')['C1'].transform(np.sum)
In [83]: df
Out[83]:
C1 C2 Key F1
0 1 7 a 3
1 2 8 a 3
2 3 9 b 7
3 4 10 b 7
4 5 11 c 11
5 6 12 c 11
如果要执行条件GroupBy,则可以在将数据帧传递到.groubpy
时对其进行过滤.例如,如果您想要C2小于8或大于9的组和"C1".
If you want to do a conditional GroupBy, you can just filter the dataframe as it's passed to .groubpy
. For example, if you wanted the group sum of 'C1' if C2 is less than 8 or greater than 9.
In [87]: cond = (df['C2'] < 8) | (df['C2'] > 9)
In [88]: df['F2'] = df[cond].groupby('Key')['C1'].transform(np.sum)
In [89]: df
Out[89]:
C1 C2 Key F1 F2
0 1 7 a 3 1
1 2 8 a 3 NaN
2 3 9 b 7 NaN
3 4 10 b 7 4
4 5 11 c 11 11
5 6 12 c 11 11
之所以可行,是因为transform
操作会保留索引,因此它仍将与原始数据帧正确对齐.
This works because the transform
operation preserves the index, so it will still align with the original dataframe correctly.
如果您想将两列的组总计相加,可能最简单的方法是这样?有人可能会更聪明.
If you want to sum the group totals for two columns, probably easiest to do something like this? Someone may have something more clever.
In [93]: gb = df.groupby('Key')
In [94]: df['C1+C2'] = gb['C1'].transform(np.sum) + gb['C2'].transform(np.sum)
这是获取多列组总计的另一种方法.语法实际上并没有任何清理方法,但是对于大量的列,可能会更方便.
Here's one other way to get group totals for multiple columns. The syntax isn't really any cleaner, but may be more convenient for a large number of a columns.
df['C1_C2'] = gb[['C1','C2']].apply(lambda x: pd.DataFrame(x.sum().sum(), index=x.index, columns=['']))
这篇关于 pandas :特定于条件的组计算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!