pandas 和 groupby:如何计算 agg 内的加权平均值 [英] pandas and groupby: how to calculate weighted averages within an agg
问题描述
我使用 groupby 和 agg 计算了许多聚合函数,因为我需要针对不同的变量使用不同的聚合函数,例如不是所有的总和,而是 x 的总和和均值、y 的均值等.
I calculate a number of aggregate functions using groupby and agg , because I need different aggregate functions for different variables, e.g. not the sum of all, but sum and mean of x, mean of y, etc.
有没有办法使用 agg 计算加权平均值?我找到了很多例子,但没有一个带有 agg.
Is there a way to calculate a weighted average using agg? I have found lots of examples, but none with agg.
我可以手动计算加权平均值,如下面的代码(注意带**的行),但我想知道是否有更优雅、更直接的方法?
I can calculate the weighted average manually, as in the code below (note the lines with **), but I was wondering if there is a more elegant and direct way?
我可以创建自己的函数并将其与 agg 一起使用吗?
Can I create my own function and use that with agg?
为了清楚起见,我完全理解还有其他解决方案,例如
还有很多很多.但是,正如我所说,我不确定如何使用 agg 实现这些解决方案,我需要 agg 因为我需要将不同的聚合函数应用于不同的列(同样,不是所有的总和,但是 x 的总和和均值、y 的均值等).
and lots, lots more. However, as I said, I am not sure how to implement these solutions with an agg, and I need agg because I need to apply different aggregate functions to different columns (again, not the sum of all, but sum and mean of x, mean of y, etc.).
谢谢!
import numpy as np
import pandas as pd
df= pd.DataFrame(np.random.randint(5,8,(1000,4)), columns=['a','b','c','d'])
**df['c * b']= df['c']* df['b']**
g = df.groupby('a').agg(
{'b':['sum', lambda x: x.sum() / df['b'] .sum(), 'mean'],
'c':['sum','mean'], 'd':['sum'],
'c * b':['sum']})
g.columns = g.columns.map('_'.join)
**g['weighted average of c'] = g['c * b_sum'] / g['b_sum']**
推荐答案
有没有可能,但是真的很复杂:
Is it possible, but really complicated:
np.random.seed(234)
df= pd.DataFrame(np.random.randint(5,8,(1000,4)), columns=['a','b','c','d'])
wm = lambda x: (x * df.loc[x.index, "c"]).sum() / x.sum()
wm.__name__ = 'wa'
f = lambda x: x.sum() / df['b'] .sum()
f.__name__ = '%'
g = df.groupby('a').agg(
{'b':['sum', f, 'mean', wm],
'c':['sum','mean'],
'd':['sum']})
g.columns = g.columns.map('_'.join)
print (g)
d_sum c_sum c_mean b_sum b_% b_mean b_wa
a
5 2104 2062 5.976812 2067 0.344672 5.991304 5.969521
6 1859 1857 5.951923 1875 0.312656 6.009615 5.954667
7 2058 2084 6.075802 2055 0.342671 5.991254 6.085645
应用解决方案:
def func(x):
# print (x)
b1 = x['b'].sum()
b2 = x['b'].sum() / df['b'].sum()
b3 = (x['b'] * x['c']).sum() / x['b'].sum()
b4 = x['b'].mean()
c1 = x['c'].sum()
c2 = x['c'].mean()
d1 = x['d'].sum()
cols = ['b sum','b %','wa', 'b mean', 'c sum', 'c mean', 'd sum']
return pd.Series([b1,b2,b3,b4,c1,c2,d1], index=cols)
g = df.groupby('a').apply(func)
print (g)
b sum b % wa b mean c sum c mean d sum
a
5 2067.0 0.344672 5.969521 5.991304 2062.0 5.976812 2104.0
6 1875.0 0.312656 5.954667 6.009615 1857.0 5.951923 1859.0
7 2055.0 0.342671 6.085645 5.991254 2084.0 6.075802 2058.0
<小时>
g.loc['total']=g.sum()
print (g)
b sum b % wa b mean c sum c mean d sum
a
5 2067.0 0.344672 5.969521 5.991304 2062.0 5.976812 2104.0
6 1875.0 0.312656 5.954667 6.009615 1857.0 5.951923 1859.0
7 2055.0 0.342671 6.085645 5.991254 2084.0 6.075802 2058.0
total 5997.0 1.000000 18.009832 17.992173 6003.0 18.004536 6021.0
这篇关于pandas 和 groupby:如何计算 agg 内的加权平均值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!