pandas ：groupby和可变权重 [英] pandas: groupby and variable weights

查看：113 发布时间：2018/5/30 14:03:17 python group-by pandas weighted-average

本文介绍了 pandas ：groupby和可变权重的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个数据集，每个观察值都有权重，我想用 groupby 来准备加权总结，但是如何最好地做到这一点是生疏的。我认为这意味着一个自定义的聚合函数。我的问题是如何正确处理不是按项目的数据，而是按组的数据。也许这意味着最好是这样做，而不是一气呵成。

在伪代码中，我正在寻找

  #first，为每一行计算加权值
：
加权工作=权重*工作
＃然后，对于每个城市，将这些权重相加并除以每个城市的计数（权重的总和）
：
 sum（加权作业）/ sum（权重）

我不确定如何使用针对每个城市 - 分成自定义集合函数并获得组级汇总。

模拟数据：

 将pandas导入为pd 
将numpy导入为np 
 np.random.seed（43）
 
 ##准模拟数据
 N = 100 
行业= ['公用事业'，'销售'，'房地产'， 'finance'] 
 city = ['sf'，'san mateo'，'oakland'] 
 weight = np.random.randint（low = 5，high = 40，size = N）
 jobs = np.random.randint（low = 1，high = 20，size = N）
 ind = np.random.choice（industry，N）
 cty = np.random.choice（city，N）
 df_city = pd.DataFrame（{'industry'：ind，'city'：cty，'weight'：weight，'jobs'：jobs}）

解决方案

简单地乘以两列：

 在[11]中：df_city ['weighted_jobs'] = df_city ['weight'] * df_city ['jobs'] 
  
 
 
 现在您可以将城市分组（并取得总和）： 
 
 
 在[12]中：df_city_sums = df_city.groupby（'city'）。sum（）
 
在[13]中：df_city_sums 
 Out [13]：
工作重量weighted_jobs 
城市
奥克兰362 690 7958 
圣地亚哥367 1017 9026 
 sf 253 638 6209 
 
 [3行x 3列] 
  
现在您可以将两个总和相除，以得到理想的结果：
 
 
 在[14]中：df_city_sums ['weighted_jobs'] / df_city_sums ['jobs'] 
 Out [14]：
 city 
 oakland 21.983425 
 san mateo 24.594005 
 sf 24.541502 
 dtype：float64 
  
 
I have a dataset with weights for each observation and I want to prepare weighted summaries using groupby but am rusty as to how to best do this. I think it implies a custom aggregation function. My issue is how to properly deal with not item-wise data, but group-wise data. Perhaps it means that it is best to do this in steps rather than in one go.


In pseudo-code, I am looking for
#first, calculate weighted value
for each row:
  weighted jobs = weight * jobs
#then, for each city, sum these weights and divide by the count (sum of weights)
for each city:
  sum(weighted jobs)/sum(weight)
I am not sure how to work the "for each city"-part into a custom aggregate function and get access to group-level summaries.

Mock data:
import pandas as pd
import numpy as np
np.random.seed(43)

## prep mock data
N = 100
industry = ['utilities','sales','real estate','finance']
city = ['sf','san mateo','oakland']
weight = np.random.randint(low=5,high=40,size=N)
jobs = np.random.randint(low=1,high=20,size=N)
ind = np.random.choice(industry, N)
cty = np.random.choice(city, N)
df_city =pd.DataFrame({'industry':ind,'city':cty,'weight':weight,'jobs':jobs})

 解决方案 
Simply multiply the two columns:
In [11]: df_city['weighted_jobs'] = df_city['weight'] * df_city['jobs']
Now you can groupby the city (and take the sum):
In [12]: df_city_sums = df_city.groupby('city').sum()

In [13]: df_city_sums
Out[13]: 
           jobs  weight  weighted_jobs
city                                  
oakland     362     690           7958
san mateo   367    1017           9026
sf          253     638           6209

[3 rows x 3 columns]
Now you can divide the two sums, to get the desired result:
In [14]: df_city_sums['weighted_jobs'] / df_city_sums['jobs']
Out[14]: 
city
oakland      21.983425
san mateo    24.594005
sf           24.541502
dtype: float64


                        
这篇关于 pandas ：groupby和可变权重的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

pandas ：groupby和可变权重 [英] pandas: groupby and variable weights

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

pandas ：groupby和可变权重 [英] pandas: groupby and variable weights

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭