pandas 和 groupby:如何计算 agg 内的加权平均值 [英] pandas and groupby: how to calculate weighted averages within an agg

查看:151
本文介绍了pandas 和 groupby:如何计算 agg 内的加权平均值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 groupby 和 agg 计算了许多聚合函数,因为我需要针对不同的变量使用不同的聚合函数,例如不是所有的总和,而是 x 的总和和均值、y 的均值等.

I calculate a number of aggregate functions using groupby and agg , because I need different aggregate functions for different variables, e.g. not the sum of all, but sum and mean of x, mean of y, etc.

有没有办法使用 agg 计算加权平均值?我找到了很多例子,但没有一个带有 agg.

Is there a way to calculate a weighted average using agg? I have found lots of examples, but none with agg.

我可以手动计算加权平均值,如下面的代码(注意带**的行),但我想知道是否有更优雅、更直接的方法?

I can calculate the weighted average manually, as in the code below (note the lines with **), but I was wondering if there is a more elegant and direct way?

我可以创建自己的函数并将其与 agg 一起使用吗?

Can I create my own function and use that with agg?

为了清楚起见,我完全理解还有其他解决方案,例如

pandas 数据框中的分组加权平均值和总和

使用熊猫数据框计算加权平均值

还有很多很多.但是,正如我所说,我不确定如何使用 agg 实现这些解决方案,我需要 agg 因为我需要将不同的聚合函数应用于不同的列(同样,不是所有的总和,但是 x 的总和和均值、y 的均值等).

and lots, lots more. However, as I said, I am not sure how to implement these solutions with an agg, and I need agg because I need to apply different aggregate functions to different columns (again, not the sum of all, but sum and mean of x, mean of y, etc.).

谢谢!

import numpy as np
import pandas as pd
df= pd.DataFrame(np.random.randint(5,8,(1000,4)), columns=['a','b','c','d'])
**df['c * b']= df['c']* df['b']**
g = df.groupby('a').agg(
        {'b':['sum', lambda x: x.sum() / df['b'] .sum(), 'mean'],
              'c':['sum','mean'], 'd':['sum'],
              'c * b':['sum']})
g.columns = g.columns.map('_'.join)
**g['weighted average of c'] = g['c * b_sum'] / g['b_sum']**

推荐答案

有没有可能,但是真的很复杂:

Is it possible, but really complicated:

np.random.seed(234)
df= pd.DataFrame(np.random.randint(5,8,(1000,4)), columns=['a','b','c','d'])

wm = lambda x: (x * df.loc[x.index, "c"]).sum() / x.sum()
wm.__name__ = 'wa'

f = lambda x: x.sum() / df['b'] .sum()
f.__name__ = '%'

g = df.groupby('a').agg(
        {'b':['sum', f, 'mean', wm],
         'c':['sum','mean'], 
         'd':['sum']})
g.columns = g.columns.map('_'.join)
print (g)

   d_sum  c_sum    c_mean  b_sum       b_%    b_mean      b_wa
a                                                             
5   2104   2062  5.976812   2067  0.344672  5.991304  5.969521
6   1859   1857  5.951923   1875  0.312656  6.009615  5.954667
7   2058   2084  6.075802   2055  0.342671  5.991254  6.085645

应用解决方案:

def func(x):
#    print (x)
    b1 = x['b'].sum()
    b2 = x['b'].sum() / df['b'].sum()
    b3 = (x['b'] * x['c']).sum() / x['b'].sum()
    b4 = x['b'].mean()

    c1 = x['c'].sum()
    c2 = x['c'].mean()

    d1 = x['d'].sum()
    cols = ['b sum','b %','wa', 'b mean', 'c sum', 'c mean', 'd sum']
    return pd.Series([b1,b2,b3,b4,c1,c2,d1], index=cols)


g = df.groupby('a').apply(func)
print (g)
    b sum       b %        wa    b mean   c sum    c mean   d sum
a                                                                
5  2067.0  0.344672  5.969521  5.991304  2062.0  5.976812  2104.0
6  1875.0  0.312656  5.954667  6.009615  1857.0  5.951923  1859.0
7  2055.0  0.342671  6.085645  5.991254  2084.0  6.075802  2058.0

<小时>

g.loc['total']=g.sum()
print (g)
        b sum       b %         wa     b mean   c sum     c mean   d sum
a                                                                       
5      2067.0  0.344672   5.969521   5.991304  2062.0   5.976812  2104.0
6      1875.0  0.312656   5.954667   6.009615  1857.0   5.951923  1859.0
7      2055.0  0.342671   6.085645   5.991254  2084.0   6.075802  2058.0
total  5997.0  1.000000  18.009832  17.992173  6003.0  18.004536  6021.0

这篇关于pandas 和 groupby:如何计算 agg 内的加权平均值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆