使用Pandas数据框计算加权平均值 [英] Calculate weighted average with pandas dataframe
问题描述
我有以下熊猫数据框:
data_df = pd.DataFrame({'ind':['la','p','la','la','p','g','g','la'],
'dist':[10.,5.,7.,8.,7.,2.,5.,3.],
'diff':[0.54,3.2,8.6,7.2,2.1,1.,3.5,4.5],
'cas':[1.,2.,3.,4.,5.,6.,7.,8.]})
那是
cas diff dist ind
0 1 0.54 10 la
1 2 3.20 5 p
2 3 8.60 7 la
3 4 7.20 8 la
4 5 2.10 7 p
5 6 1.00 2 g
6 7 3.50 5 g
7 8 4.50 3 la
我需要计算权重在"dist"列中的所有列的加权平均值,然后将值按"ind"分组.
I need to compute the weighted average of all the columns where the weights are in the 'dist' column and group the values by 'ind'.
例如,"ind" ="la"和"diff"列:
For example for 'ind'='la' and the 'diff' column:
((10*0.54)+(8.60*7)+(7.20*8)+(4.50*3))/(10+7+8+3) = 4.882143
我想要获得的结果如下
cas diff
ind
g 6.714286 2.785714
la 3.107143 4.882143
p 3.750000 2.558333
通过将每个列的每个值乘以"dist"列中的对应值,然后将结果与相同的"ind"相加,然后将结果除以所有与之对应的所有"dist"值之和而得到的
相同的ind.
which is obtained by multiplying each value of each colums by the corrisponding value in the 'dist' column, sum the results with the same 'ind' and then dividing the result by the sum of all the 'dist' values corrisponding to the same ind.
我认为通过dataframe的groupby方法完成这项任务很容易,但实际上有些棘手.
I thought this would have been an easy task done by the dataframe 'groupby' method, but actually it's kind of tricky.
有人可以帮我吗?
推荐答案
You may obtain within groups normalized weights by using transform
:
>>> df['weight'] = df['dist'] / df.groupby('ind')['dist'].transform('sum')
>>> df['weight']
0 0.357143
1 0.416667
2 0.250000
3 0.285714
4 0.583333
5 0.285714
6 0.714286
7 0.107143
Name: weight, dtype: float64
然后,您只需将这些权重乘以值,然后求和:
Then, you just need to multiply these weight by the values, and take the sum:
>>> df['wcas'], df['wdiff'] = (df[n] * df['weight'] for n in ('cas', 'diff'))
>>> df.groupby('ind')[['wcas', 'wdiff']].sum()
wcas wdiff
ind
g 6.714286 2.785714
la 3.107143 4.882143
p 3.750000 2.558333
具有原位突变:
with in-place mutation:
>>> backup = df.copy() # make a backup copy to mutate in place
>>> cols = df.columns[:2] # cas, diff
>>> df[cols] = df['weight'].values[:, None] * df[cols]
>>> df.groupby('ind')[cols].sum()
cas diff
ind
g 6.714286 2.785714
la 3.107143 4.882143
p 3.750000 2.558333
这篇关于使用Pandas数据框计算加权平均值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!