分组后组合列以进行汇总 [英] Combination of columns for aggregation after groupby

查看:234
本文介绍了分组后组合列以进行汇总的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

正在寻找类似的东西

df.groubpy('key').aggregate(combination(columnA, columnB))

代替

df['combination'] = combination(columnA, columnB)
df.groupby('key')['combination'].aggregate()

唯一的要求是列的组合是在groupby之后计算的.

The only requirement is that the combination of columns is calculated after the groupby.

从逻辑上讲,我似乎很自然,在某些情况下先分组然后聚合.

I seems natural, logically wise, for some cases to first groupby and then aggregate.

一个示例是针对使用相同组的列的不同组合使用不同的聚合函数.

One example would be different aggregate functions for different combinations of columns that use the same groups.

寻找

  1. groupby
  2. 选择列的组合
  3. 使用相应的聚合函数

代替

  1. 创建所有必要的列(用于每个聚合函数)
  2. groupby(用于每个聚合函数)
  3. 应用特定的聚合函数


示例

key     ColumnA  ColumnB
key1       1        1
key2       2        2
key1       3        3
key3       4        4
key2       5        5

#can do
df['combination'] = df.columnA * df.columnB
df.groupby('key').mean()

#looking for
grouped = df.groupby('key')
grouped.mean(columnA * columnB)

推荐答案

好,所以我认为您正在寻找的答案是-我们不这样做,因为python中的矢量化

Ok, so I think the answer you are looking for is - we don't do that, because of vectorization in python

请考虑以下代码.

现在从本质上讲-python通常经过优化,可以向量化方式执行某些数学运算(例如,以numpypandas表示)-这意味着-将其应用于整个向量要比将其分解为更快块,然后执行它.

Now in essence - python is often optimized to execute certain mathematical operations in a vectorized way (take numpy or pandas for instance) - which means - applying it to the whole vector is faster, than breaking it down into chunks, and executing it then.

例如df["A"].mul(df["B"])将比:df.apply(lambda X: X["A"]*X["B"], axis=0)快.分组同样适用-只是这种方式更具可扩展性.

So e.g. df["A"].mul(df["B"]) will be faster than: df.apply(lambda X: X["A"]*X["B"], axis=0). Same goes for grouping - it's just way more scalable that way.

尝试以下代码-本质上就是您所指的-因此在groupby(...)之后的Vs之前执行操作.即使您实现了更多的列,矢量化解决方案的扩展速度也相当快-处理的行数越多,您将看到的差异就越大.

Try the below code - it's in essence what you were referring to - so doing operation before Vs after groupby(...). The vectorized solution scales up pretty fast, even though you materialize additional column - the more rows you process the bigger difference you will see.

修改

我在分组数据上添加了矢量化解决方案,所以我们有:

I added vectorized solution on grouped data, so we have:

(1)我们分组,我们逐行懒惰地评估

(1) we group, we evaluate lazily line by line

(2)我们以向量化的方式处理完整的df,我们将应用内置的聚合功能分组

(2) we process full df in a vectorized way, we group we apply built-in aggregating function

(3)我们分组,我们以向量化的方式进行分组处理,逐组进行聚合功能

(3) we group, we process in a vectorized way group, by group, we do aggregating function

从本质上讲-从结果来看,无论是按组还是按记录,分解成块都会减慢处理速度-因此矢量化解决方案的扩展性优于我们可以在其上应用的任何类型的自定义解决方案.

in essence - from the result we see breaking down into chunks slows down the processing, regardless whether it's per groups, or per record - so vectorized solution scales better than any kind of custom solution that we can apply on top.

import pandas as pd
import numpy as np
import time

x=np.random.randint(1,9,(3000,5))
df=pd.DataFrame(x, columns=[f"column{l}" for l in list("ABCDE")])
df["cat"]=np.random.choice([f"key{l}" for l in list("ABCDEFG")], size=3000)
df2=df3=df
#print(df)
s=time.time()
df.groupby("cat").apply(lambda z: np.prod(z.values, axis=1).mean()).pipe(print)
e=time.time()-s
print(f"method 1: {e} s")

s=time.time()
df2["prod"]=np.prod(df[[f"column{l}" for l in list("ABCDE")]], axis=1)
df2.groupby("cat")["prod"].mean().pipe(print)
e=time.time()-s
print(f"method 2: {e} s")

s=time.time()
df3=list(map(lambda x: (x[0], np.prod(x[1][[f"column{l}" for l in list("ABCDE")]], axis=1).mean()), df3.groupby("cat")))
print(df3)
e=time.time()-s
print(f"method 3: {e} s")

这篇关于分组后组合列以进行汇总的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆