pandas :groupby和可变权重 [英] pandas: groupby and variable weights
问题描述
我有一个数据集,每个观察值都有权重,我想用 groupby
来准备加权总结,但是如何最好地做到这一点是生疏的。我认为这意味着一个自定义的聚合函数。我的问题是如何正确处理不是按项目的数据,而是按组的数据。也许这意味着最好是这样做,而不是一气呵成。
在伪代码中,我正在寻找
#first,为每一行计算加权值
:
加权工作=权重*工作
#然后,对于每个城市,将这些权重相加并除以每个城市的计数(权重的总和)
:
sum(加权作业)/ sum(权重)
我不确定如何使用针对每个城市 - 分成自定义集合函数并获得组级汇总。
模拟数据:
将pandas导入为pd
将numpy导入为np
np.random.seed(43)
##准模拟数据
N = 100
行业= ['公用事业','销售','房地产', 'finance']
city = ['sf','san mateo','oakland']
weight = np.random.randint(low = 5,high = 40,size = N)
jobs = np.random.randint(low = 1,high = 20,size = N)
ind = np.random.choice(industry,N)
cty = np.random.choice(city,N)
df_city = pd.DataFrame({'industry':ind,'city':cty,'weight':weight,'jobs':jobs})
简单地乘以两列:
在[11]中:df_city ['weighted_jobs'] = df_city ['weight'] * df_city ['jobs']
现在您可以将城市分组(并取得总和):
在[12]中:df_city_sums = df_city.groupby('city')。sum()
在[13]中:df_city_sums
Out [13]:
工作重量weighted_jobs
城市
奥克兰362 690 7958
圣地亚哥367 1017 9026
sf 253 638 6209
[3行x 3列]
现在您可以将两个总和相除,以得到理想的结果:
在[14]中:df_city_sums ['weighted_jobs'] / df_city_sums ['jobs']
Out [14]:
city
oakland 21.983425
san mateo 24.594005
sf 24.541502
dtype:float64
I have a dataset with weights for each observation and I want to prepare weighted summaries using
groupby
but am rusty as to how to best do this. I think it implies a custom aggregation function. My issue is how to properly deal with not item-wise data, but group-wise data. Perhaps it means that it is best to do this in steps rather than in one go.In pseudo-code, I am looking for
#first, calculate weighted value for each row: weighted jobs = weight * jobs #then, for each city, sum these weights and divide by the count (sum of weights) for each city: sum(weighted jobs)/sum(weight)
I am not sure how to work the "for each city"-part into a custom aggregate function and get access to group-level summaries.
Mock data:
import pandas as pd import numpy as np np.random.seed(43) ## prep mock data N = 100 industry = ['utilities','sales','real estate','finance'] city = ['sf','san mateo','oakland'] weight = np.random.randint(low=5,high=40,size=N) jobs = np.random.randint(low=1,high=20,size=N) ind = np.random.choice(industry, N) cty = np.random.choice(city, N) df_city =pd.DataFrame({'industry':ind,'city':cty,'weight':weight,'jobs':jobs})
解决方案Simply multiply the two columns:
In [11]: df_city['weighted_jobs'] = df_city['weight'] * df_city['jobs']
Now you can groupby the city (and take the sum):
In [12]: df_city_sums = df_city.groupby('city').sum() In [13]: df_city_sums Out[13]: jobs weight weighted_jobs city oakland 362 690 7958 san mateo 367 1017 9026 sf 253 638 6209 [3 rows x 3 columns]
Now you can divide the two sums, to get the desired result:
In [14]: df_city_sums['weighted_jobs'] / df_city_sums['jobs'] Out[14]: city oakland 21.983425 san mateo 24.594005 sf 24.541502 dtype: float64
这篇关于 pandas :groupby和可变权重的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!