在按A列分组并按B列汇总时获取columnC的对应值 [英] Getting the corresponding value of columnC while grouping by column A and aggregating by columnB

查看:42
本文介绍了在按A列分组并按B列汇总时获取columnC的对应值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

此答案为基础,并给出

>>> df
  columnA  columnB  columnC
0    cat1        3      400
1    cat1        2       20
2    cat1        5     3029
3    cat2        1      492
4    cat2        4       30
5    cat3        2      203
6    cat3        6      402
7    cat3        4      391

>>> df.groupby(['columnA']).agg({'columnA':'size','columnB':'min'}).rename(columns={'columnA':'size'})

         size  min
columnA           
cat1        3    2
cat2        2    1
cat3        3    2

我想获得一个数据框,其中还包含与columnB的显示最小值(在同一行上)相对应的columnC值,即:

I want to obtain a DataFrame containing also the value of columnC corresponding to (on the same row of) the displayed minimum value of columnB, that is:

         size  min  columnC
columnA           
cat1        3    2       20
cat2        2    1      492
cat3        3    2      203

当然,只有那些从中选择一个值的聚合函数(如min或max)才有可能

Of course this is possible only for those aggregating functions (like min or max) which 'pick' a value from the group rather than 'aggregate' (like sum or average).

任何线索吗?

请先谢谢。

推荐答案

由于您要查找的结果实质上是 ['columnA',' columnB'] ,您可以使用

Since the result you are looking for is essentially a join on ['columnA', 'columnB'], you can obtain the desired DataFrame using

result = pd.merge(result, df, on=['columnA', 'columnB'], how='left')

只要我们设置结果,其右列名称为:

provided we setup result with the right column names:

import pandas as pd

df = pd.DataFrame(
    {'columnA': ['cat1', 'cat1', 'cat1', 'cat2', 'cat2', 'cat3', 'cat3', 'cat3'],
     'columnB': [3, 2, 5, 1, 4, 2, 6, 4],
     'columnC': [400, 20, 3029, 492, 30, 203, 402, 391]})

result = df.groupby('columnA').agg({'columnA':'size', 'columnB':'min'})
result = result.rename(columns={'columnA':'size'})
result = result.reset_index()
result = pd.merge(result, df, on=['columnA', 'columnB'], how='left')
result = result.set_index('columnA')
result = result.rename(columns={'columnB':'min'})
print(result)

收益率

         min  size  columnC
columnA                    
cat1       2     3       20
cat2       1     2      492
cat3       2     3      203






为什么要使用 pd.merge 而不是 groupby / apply 是因为 groupby / apply 为每个组调用一个函数。如果组很多,这可能会很慢。


On reason why you might want to use pd.merge instead of groupby/apply is because groupby/apply calls a function for each group. If there are a lot of groups, this can be slow.

例如,如果您有一个具有1000个组的10000行DataFrame,则

For example, if you had a 10000-row DataFrame with 1000 groups,

import numpy as np
import pandas as pd

N = 10000
df = pd.DataFrame(
    {'columnA': np.random.choice(['cat{}'.format(i) for i in range(N//10)], 
                                 size=N),
     'columnB': np.random.randint(10, size=N),
     'columnC': np.random.randint(100, size=N)})

然后 using_merge (如下)比 using_apply <快250倍

def using_merge(df):
    result = df.groupby('columnA').agg({'columnA':'size', 'columnB':'min'})
    result = result.rename(columns={'columnA':'size'})
    result = result.reset_index()
    result = pd.merge(result, df, on=['columnA', 'columnB'], how='left')
    result = result.set_index('columnA')
    result = result.rename(columns={'columnB':'min'})
    return result

def using_apply(df):
    return (df.groupby("columnA")
            .apply(lambda g: (g[g.columnB == g.columnB.min()]
                   .assign(size = g.columnA.size)
                   .rename(columns={'columnB': 'min'})
                   .drop('columnA', 1)))
            .reset_index(level=1, drop=True))







In [80]: %timeit using_merge(df)
100 loops, best of 3: 7.99 ms per loop

In [81]: %timeit using_apply(df)
1 loop, best of 3: 2.06 s per loop

In [82]: 2060/7.99
Out[82]: 257.8222778473091

这篇关于在按A列分组并按B列汇总时获取columnC的对应值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆