使用Pandas数据框计算加权平均值 [英] Calculate weighted average with pandas dataframe

查看:1058
本文介绍了使用Pandas数据框计算加权平均值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下熊猫数据框:

data_df = pd.DataFrame({'ind':['la','p','la','la','p','g','g','la'],
                        'dist':[10.,5.,7.,8.,7.,2.,5.,3.],
                        'diff':[0.54,3.2,8.6,7.2,2.1,1.,3.5,4.5],
                        'cas':[1.,2.,3.,4.,5.,6.,7.,8.]})  

那是

    cas  diff  dist ind
0    1  0.54    10  la
1    2  3.20     5   p
2    3  8.60     7  la
3    4  7.20     8  la
4    5  2.10     7   p
5    6  1.00     2   g
6    7  3.50     5   g
7    8  4.50     3  la

我需要计算权重在"dist"列中的所有列的加权平均值,然后将值按"ind"分组.

I need to compute the weighted average of all the columns where the weights are in the 'dist' column and group the values by 'ind'.

例如,"ind" ="la"和"diff"列:

For example for 'ind'='la' and the 'diff' column:

((10*0.54)+(8.60*7)+(7.20*8)+(4.50*3))/(10+7+8+3) = 4.882143

我想要获得的结果如下

        cas      diff
ind                    
 g    6.714286  2.785714
 la   3.107143  4.882143
 p    3.750000  2.558333

通过将每个列的每个值乘以"dist"列中的对应值,然后将结果与相同的"ind"相加,然后将结果除以所有与之对应的所有"dist"值之和而得到的

相同的ind.

which is obtained by multiplying each value of each colums by the corrisponding value in the 'dist' column, sum the results with the same 'ind' and then dividing the result by the sum of all the 'dist' values corrisponding to the same ind.

我认为通过dataframe的groupby方法完成这项任务很容易,但实际上有些棘手.

I thought this would have been an easy task done by the dataframe 'groupby' method, but actually it's kind of tricky.

有人可以帮我吗?

推荐答案

您可以使用

You may obtain within groups normalized weights by using transform:

>>> df['weight'] = df['dist'] / df.groupby('ind')['dist'].transform('sum')
>>> df['weight']
0    0.357143
1    0.416667
2    0.250000
3    0.285714
4    0.583333
5    0.285714
6    0.714286
7    0.107143
Name: weight, dtype: float64

然后,您只需将这些权重乘以值,然后求和:

Then, you just need to multiply these weight by the values, and take the sum:

>>> df['wcas'], df['wdiff'] = (df[n] * df['weight'] for n in ('cas', 'diff'))
>>> df.groupby('ind')[['wcas', 'wdiff']].sum()
         wcas     wdiff
ind                    
g    6.714286  2.785714
la   3.107143  4.882143
p    3.750000  2.558333


具有原位突变:


with in-place mutation:

>>> backup = df.copy()     # make a backup copy to mutate in place
>>> cols = df.columns[:2]  # cas, diff
>>> df[cols] = df['weight'].values[:, None] * df[cols]
>>> df.groupby('ind')[cols].sum()
          cas      diff
ind                    
g    6.714286  2.785714
la   3.107143  4.882143
p    3.750000  2.558333

这篇关于使用Pandas数据框计算加权平均值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆