如何分组多个列并聚合不同列上的差异? [英] How to groupby multiple columns and aggregate diff on different columns?
本文介绍了如何分组多个列并聚合不同列上的差异?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我正在这里寻求有关如何在 Python/Panda 中执行此操作的帮助:
I am looking for help here on how to do this in Python / Panda:
我正在寻找原始数据(如下),并通过具有多个 cols(州、县和日期)的组找到多个 cols(cnt_a 和 cnt_b)的每日差异.
I am looking to take the original data (below) and find the daily difference of multiple cols (cnt_a and cnt_b) by a group with multiple cols (state, county and date).
我一直在尝试不同的方法,但似乎无法通过检查重复项"来解决问题.问题
I've been trying it different ways, and I can't seem to get by the "check for duplicate" issue
df.cnt_a = df.sort_values(['state','county','date']).groupby['state','county','date','cnt_a'].diff(-1)
尝试将其拆分以一次解决一件事:
Tried splitting it out to fix one thing at a time:
df1 = df.sort_values(['state','county','date'])
df2 = df1.groupby(['state','county'])['cnt_a'].diff()
原始数据.=>df
date county state cnt_a cnt_b
2020-06-13 Bergen New Jersey 308 11
2020-06-14 Bergen New Jersey 308 11
2020-06-15 Bergen New Jersey 320 15
2020-06-12 Union New Jersey 100 3
2020-06-13 Union New Jersey 130 4
2020-06-14 Union New Jersey 150 5
2020-06-12 Bronx New York 200 100
2020-06-13 Bronx New York 210 200
想要的输出
date county state cnt_a cnt_b daydiff_a daydiff_b
2020-06-13 Bergen New Jersey 308 11 0 0
2020-06-14 Bergen New Jersey 308 11 0 0
2020-06-15 Bergen New Jersey 320 15 12 4
2020-06-12 Union New Jersey 100 3 0 0
2020-06-13 Union New Jersey 130 4 30 1
2020-06-14 Union New Jersey 150 5 20 1
2020-06-12 Bronx New York 200 100 0 0
2020-06-13 Bronx New York 210 200 10 100
推荐答案
- 对
df
进行排序很重要,因为df.groupby
将被排序.如果df
没有先排序,.groupby
中的连接列将与df
的顺序不同.- 一定要
df
,按'state'
、'country'
和'date'
的顺序code>,然而,.groupby
中的'date'
列被忽略. - It's important to sort
df
, becausedf.groupby
will be sorted. Ifdf
isn't sorted first, the joined columns from.groupby
will not be in the same order asdf
.- Be certain to
df
, in order, by'state'
,'country'
, and'date'
, however, the'date'
column is ignored in.groupby
. - 指定
rsuffix
,或使用.rename
更改列标题. - Specify
rsuffix
, and or use.rename
to change the column headers.
import pandas as pd # setup the test dataframe data = {'date': ['2020-06-13', '2020-06-14', '2020-06-15', '2020-06-12', '2020-06-13', '2020-06-14', '2020-06-12', '2020-06-13'], 'county': ['Bergen', 'Bergen', 'Bergen', 'Union', 'Union', 'Union', 'Bronx', 'Bronx'], 'state': ['New Jersey', 'New Jersey', 'New Jersey', 'New Jersey', 'New Jersey', 'New Jersey', 'New York', 'New York'], 'cnt_a': [308, 308, 320, 100, 130, 150, 200, 210], 'cnt_b': [11, 11, 15, 3, 4, 5, 100, 200]} df = pd.DataFrame(data) # set the date column to a datetime format df.date = pd.to_datetime(df.date) # sort the values df = df.sort_values(['state', 'county', 'date']) # groupby and join back to dataframe df df = df.join(df.groupby(['state', 'county'])[['cnt_a', 'cnt_b']].diff().fillna(0), rsuffix='_diff') # display(df) date county state cnt_a cnt_b cnt_a_diff cnt_b_diff 0 2020-06-13 Bergen New Jersey 308 11 0.0 0.0 1 2020-06-14 Bergen New Jersey 308 11 0.0 0.0 2 2020-06-15 Bergen New Jersey 320 15 12.0 4.0 3 2020-06-12 Union New Jersey 100 3 0.0 0.0 4 2020-06-13 Union New Jersey 130 4 30.0 1.0 5 2020-06-14 Union New Jersey 150 5 20.0 1.0 6 2020-06-12 Bronx New York 200 100 0.0 0.0 7 2020-06-13 Bronx New York 210 200 10.0 100.0
这篇关于如何分组多个列并聚合不同列上的差异?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
- Be certain to
- 一定要
查看全文