在Pandas Dataframe中合并日期范围 [英] Combine Date Ranges in Pandas Dataframe

查看:872
本文介绍了在Pandas Dataframe中合并日期范围的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Python中有一组记录,这些记录具有ID,至少一个属性和一组日期范围.我想要接受每个id的代码,并结合属性匹配且在日期范围内没有间隔的所有记录.

I have a set of records in Python with an id, at least one attribute, and a set of date ranges. I would like code that takes each id, and combines all the records where the attributes match and there is no gap in the date range.

在日期范围内没有间隔,我的意思是一个记录的结束日期大于或等于该ID的下一个记录.

By no gap in date range, I mean that the end date of one record is greater than or equal to the next record for that id.

例如,标识为"10",开始日期为"2016-01-01",结束日期为"2017-01-01"的记录可以与该标识的另一个记录(起始日期为"2017- 01-01"和结束日期"2018-01-01",但不能与以"2017-01-10"开始的记录合并,因为与2017-01-从01到2017-01-09.

For instance, a record with id "10", start date "2016-01-01" and end date "2017-01-01" could be merged with another record with that id, a start date of "2017-01-01", and an end date of "2018-01-01", but it could NOT be merged with a record that started on "2017-01-10", because there'd be a gap from 2017-01-01 to 2017-01-09.

以下是一些示例-

拥有:

FruitID,FruitType,StartDate,EndDate
1,Apple,2015-01-01,2016-01-01
1,Apple,2016-01-01,2017-01-01
1,Apple,2017-01-01,2018-01-01
2,Orange,2015-01-01,2016-01-01
2,Orange,2016-05-31,2017-01-01
2,Orange,2017-01-01,2018-01-01
3,Banana,2015-01-01,2016-01-01
3,Banana,2016-01-01,2017-01-01
3,Blueberry,2017-01-01,2018-01-01
4,Mango,2015-01-01,2016-01-01
4,Kiwi,2016-09-15,2017-01-01
4,Mango,2017-01-01,2018-01-01

想要:

FruitID,FruitType,NewStartDate,NewEndDate
1,Apple,2015-01-01,2018-01-01
2,Orange,2015-01-01,2016-01-01
2,Orange,2016-05-31,2018-01-01
3,Banana,2015-01-01,2017-01-01
3,Blueberry,2017-01-01,2018-01-01
4,Mango,2015-01-01,2016-01-01
4,Kiwi,2016-09-15,2017-01-01
4,Mango,2017-01-01,2018-01-01

我当前的解决方案如下.它提供了我想要的结果,但是对于大型数据集,性能似乎并不理想.另外,我的印象是,您通常希望避免在可能的情况下遍历数据帧的各个行.非常感谢您提供的任何帮助!

My current solution is below. It provides the results I'm looking for, but performance doesn't seem great for large datasets. Also, my impression is that you generally want to avoid iterating over individual rows of a dataframe when possible. Thank you very much for any assistance you can provide!

import pandas as pd
from dateutil.parser import parse

have = pd.DataFrame.from_items([('FruitID', [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4]),
                                ('FruitType', ['Apple', 'Apple', 'Apple', 'Orange', 'Orange', 'Orange', 'Banana', 'Banana', 'Blueberry', 'Mango', 'Kiwi', 'Mango']),
                                ('StartDate', [parse(x) for x in ['2015-01-01', '2016-01-01', '2017-01-01', '2015-01-01', '2016-05-31',
                                                                  '2017-01-01', '2015-01-01', '2016-01-01', '2017-01-01', '2015-01-01', '2016-09-15', '2017-01-01']]),
                                ('EndDate', [parse(x) for x in ['2016-01-01', '2017-01-01', '2018-01-01', '2016-01-01', '2017-01-01',
                                                                '2018-01-01', '2016-01-01', '2017-01-01', '2018-01-01', '2016-01-01', '2017-01-01', '2018-01-01']])
                                ])

have.sort_values(['FruitID', 'StartDate'])

rowlist = []
fruit_cur_row = None

for row in have.itertuples():
    if fruit_cur_row is None:
        fruit_cur_row = row._asdict()
        fruit_cur_row.update(NewStartDate=row.StartDate, NewEndDate=row.EndDate)

    elif not(fruit_cur_row.get('FruitType') == row.FruitType):
        rowlist.append(fruit_cur_row)

        fruit_cur_row = row._asdict()
        fruit_cur_row.update(NewStartDate=row.StartDate, NewEndDate=row.EndDate)

    elif (row.StartDate <= fruit_cur_row.get('NewEndDate')):
        fruit_cur_row['NewEndDate'] = max(fruit_cur_row['NewEndDate'], row.EndDate)
    else:
        rowlist.append(fruit_cur_row)
        fruit_cur_row = row._asdict()
        fruit_cur_row.update(NewStartDate=row.StartDate, NewEndDate=row.EndDate)

rowlist.append(fruit_cur_row)
have_mrg = pd.DataFrame.from_dict(rowlist)
print(have_mrg[['FruitID', 'FruitType', 'NewStartDate', 'NewEndDate']])

推荐答案

使用嵌套的groupby方法:

def merge_dates(grp):
    # Find contiguous date groups, and get the first/last start/end date for each group.
    dt_groups = (grp['StartDate'] != grp['EndDate'].shift()).cumsum()
    return grp.groupby(dt_groups).agg({'StartDate': 'first', 'EndDate': 'last'})

# Perform a groupby and apply the merge_dates function, followed by formatting.
df = df.groupby(['FruitID', 'FruitType']).apply(merge_dates)
df = df.reset_index().drop('level_2', axis=1)

请注意,此方法假定您的日期已经排序.如果不是,则需要首先在DataFrame上使用sort_values.如果您具有嵌套的日期范围,则此方法可能不起作用.

Note that this method assumes your dates are already sorted. If not, you'll need to use sort_values on your DataFrame first. This method may not work if you have nested date spans.

结果输出:

   FruitID  FruitType   StartDate     EndDate
0        1      Apple  2015-01-01  2018-01-01
1        2     Orange  2015-01-01  2016-01-01
2        2     Orange  2016-05-31  2018-01-01
3        3     Banana  2015-01-01  2017-01-01
4        3  Blueberry  2017-01-01  2018-01-01
5        4       Kiwi  2016-09-15  2017-01-01
6        4      Mango  2015-01-01  2016-01-01
7        4      Mango  2017-01-01  2018-01-01

这篇关于在Pandas Dataframe中合并日期范围的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆