如何分组多个列并聚合不同列上的差异? [英] How to groupby multiple columns and aggregate diff on different columns?

查看:51
本文介绍了如何分组多个列并聚合不同列上的差异?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在这里寻求有关如何在 Python/Panda 中执行此操作的帮助:

I am looking for help here on how to do this in Python / Panda:

我正在寻找原始数据(如下),并通过具有多个 cols(州、县和日期)的组找到多个 cols(cnt_a 和 cnt_b)的每日差异.

I am looking to take the original data (below) and find the daily difference of multiple cols (cnt_a and cnt_b) by a group with multiple cols (state, county and date).

我一直在尝试不同的方法,但似乎无法通过检查重复项"来解决问题.问题

I've been trying it different ways, and I can't seem to get by the "check for duplicate" issue

df.cnt_a = df.sort_values(['state','county','date']).groupby['state','county','date','cnt_a'].diff(-1)

尝试将其拆分以一次解决一件事:

Tried splitting it out to fix one thing at a time:

df1 = df.sort_values(['state','county','date'])

df2 = df1.groupby(['state','county'])['cnt_a'].diff()

原始数据.=>df

date        county  state       cnt_a    cnt_b
2020-06-13  Bergen  New Jersey   308     11
2020-06-14  Bergen  New Jersey   308     11
2020-06-15  Bergen  New Jersey   320     15
2020-06-12  Union   New Jersey   100     3
2020-06-13  Union   New Jersey   130     4
2020-06-14  Union   New Jersey   150     5
2020-06-12  Bronx   New York     200     100
2020-06-13  Bronx   New York     210     200

想要的输出

date        county  state       cnt_a   cnt_b   daydiff_a    daydiff_b
2020-06-13  Bergen  New Jersey   308     11        0            0 
2020-06-14  Bergen  New Jersey   308     11        0            0
2020-06-15  Bergen  New Jersey   320     15        12           4
2020-06-12  Union   New Jersey   100     3         0            0
2020-06-13  Union   New Jersey   130     4         30           1
2020-06-14  Union   New Jersey   150     5         20           1
2020-06-12  Bronx   New York     200     100       0            0 
2020-06-13  Bronx   New York     210     200       10           100

推荐答案

  • df 进行排序很重要,因为df.groupby 将被排序.如果 df 没有先排序,.groupby 中的连接列将与 df 的顺序不同.
    • 一定要df,按'state''country''date'的顺序code>,然而,.groupby 中的 'date' 列被忽略.
      • It's important to sort df, because df.groupby will be sorted. If df isn't sorted first, the joined columns from .groupby will not be in the same order as df.
        • Be certain to df, in order, by 'state', 'country', and 'date', however, the 'date' column is ignored in .groupby.
          • 指定rsuffix,或使用.rename 更改列标题.
          • Specify rsuffix, and or use .rename to change the column headers.
          import pandas as pd
          
          # setup the test dataframe
          data = {'date': ['2020-06-13', '2020-06-14', '2020-06-15', '2020-06-12', '2020-06-13', '2020-06-14', '2020-06-12', '2020-06-13'],
                  'county': ['Bergen', 'Bergen', 'Bergen', 'Union', 'Union', 'Union', 'Bronx', 'Bronx'],
                  'state': ['New Jersey', 'New Jersey', 'New Jersey', 'New Jersey', 'New Jersey', 'New Jersey', 'New York', 'New York'],
                  'cnt_a': [308, 308, 320, 100, 130, 150, 200, 210],
                  'cnt_b': [11, 11, 15, 3, 4, 5, 100, 200]}
          
          df = pd.DataFrame(data)
          
          # set the date column to a datetime format
          df.date = pd.to_datetime(df.date)
          
          # sort the values
          df = df.sort_values(['state', 'county', 'date'])
          
          # groupby and join back to dataframe df
          df = df.join(df.groupby(['state', 'county'])[['cnt_a', 'cnt_b']].diff().fillna(0), rsuffix='_diff')
          
          # display(df)
                  date  county       state  cnt_a  cnt_b  cnt_a_diff  cnt_b_diff
          0 2020-06-13  Bergen  New Jersey    308     11         0.0         0.0
          1 2020-06-14  Bergen  New Jersey    308     11         0.0         0.0
          2 2020-06-15  Bergen  New Jersey    320     15        12.0         4.0
          3 2020-06-12   Union  New Jersey    100      3         0.0         0.0
          4 2020-06-13   Union  New Jersey    130      4        30.0         1.0
          5 2020-06-14   Union  New Jersey    150      5        20.0         1.0
          6 2020-06-12   Bronx    New York    200    100         0.0         0.0
          7 2020-06-13   Bronx    New York    210    200        10.0       100.0
          

          这篇关于如何分组多个列并聚合不同列上的差异?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆