pandas :特定于条件的组计算 [英] Pandas: conditional group-specific computations

查看:74
本文介绍了 pandas :特定于条件的组计算的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个带有键(例如客户ID)和两个数字列C1和C2的表.我想按键(客户)对行进行分组,并在其列上运行一些汇总器,例如sum和mean.计算完组聚合器之后,我想将结果分配回DataFrame中的每个客户行(因为在每行中添加了一些客户范围的功能).

Let's say I have a table with a key (e.g. customer ID) and two numeric columns C1 and C2. I would like to group rows by the key (customer) and run some aggregators like sum and mean on its columns. After computing group aggregators I would like to assign the results back to each customer row in a DataFrame (as some customer-wide features added to each row).

我看到我可以做类似的事情
df['F1'] = df.groupby(['Key'])['C1'].transform(np.sum)
如果我只想汇总一列并能够将结果添加回DataFrame.

I can see that I can do something like
df['F1'] = df.groupby(['Key'])['C1'].transform(np.sum)
if I want to aggregate just one column and be able to add the result back to the DataFrame.

我可以使其为有条件的-是否可以仅对C2列等于某个数字X的行添加一个组中的C1列,并且仍然能够将结果添加回DataFrame中?

Can I make it conditional - can I add up C1 column in a group only for rows whose C2 column is equal to some number X and still be able to add results back to the DataFrame?

如何在诸如以下行的组合上运行聚合器:
np.sum(C1 + C2)?

How can I run aggregator on a combination of rows like:
np.sum(C1 + C2)?

最简单,最优雅的实现方式是什么?最有效的方法是什么?可以在一个路径中完成这些聚合吗?

What would be the simplest and most elegant way to implement it? What is the most efficient way to do it? Can those aggregations be done in a one path?

谢谢.

推荐答案

以下是一些虚拟数据的设置.

Here's some setup of some dummy data.

In [81]: df = pd.DataFrame({'Key': ['a','a','b','b','c','c'], 
                            'C1': [1,2,3,4,5,6],  
                            'C2': [7,8,9,10,11,12]})
In [82]: df['F1'] = df.groupby('Key')['C1'].transform(np.sum)

In [83]: df
Out[83]: 
   C1  C2 Key  F1
0   1   7   a   3
1   2   8   a   3
2   3   9   b   7
3   4  10   b   7
4   5  11   c  11
5   6  12   c  11

如果要执行条件GroupBy,则可以在将数据帧传递到.groubpy时对其进行过滤.例如,如果您想要C2小于8或大于9的组和"C1".

If you want to do a conditional GroupBy, you can just filter the dataframe as it's passed to .groubpy. For example, if you wanted the group sum of 'C1' if C2 is less than 8 or greater than 9.

In [87]: cond = (df['C2'] < 8) | (df['C2'] > 9)

In [88]: df['F2'] = df[cond].groupby('Key')['C1'].transform(np.sum)

In [89]: df
Out[89]: 
   C1  C2 Key  F1  F2
0   1   7   a   3   1
1   2   8   a   3 NaN
2   3   9   b   7 NaN
3   4  10   b   7   4
4   5  11   c  11  11
5   6  12   c  11  11

之所以可行,是因为transform操作会保留索引,因此它仍将与原始数据帧正确对齐.

This works because the transform operation preserves the index, so it will still align with the original dataframe correctly.

如果您想将两列的组总计相加,可能最简单的方法是这样?有人可能会更聪明.

If you want to sum the group totals for two columns, probably easiest to do something like this? Someone may have something more clever.

In [93]: gb = df.groupby('Key')

In [94]: df['C1+C2'] = gb['C1'].transform(np.sum) + gb['C2'].transform(np.sum)

这是获取多列组总计的另一种方法.语法实际上并没有任何清理方法,但是对于大量的列,可能会更方便.

Here's one other way to get group totals for multiple columns. The syntax isn't really any cleaner, but may be more convenient for a large number of a columns.

df['C1_C2'] = gb[['C1','C2']].apply(lambda x: pd.DataFrame(x.sum().sum(), index=x.index, columns=['']))

这篇关于 pandas :特定于条件的组计算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆