如何查找日期范围重叠的行? [英] How to find rows with overlapping date ranges?

查看：69 发布时间：2021/6/13 20:36:21 python pandas

本文介绍了如何查找日期范围重叠的行?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个包含如下数据的数据框(数据的微小子集):

I have a dataframe that contains data like below (tiny subset of data):

我试图找出一种方法，我可以创建一个新的数据框，其中包含具有相同值的所有行:carrier, flightnumber, departureAirport 和 arrivalAirport 但也有重叠的日期范围.

I'm trying to figure out a way where I can create a new dataframe that contains all rows that have the same values for : carrier, flightnumber, departureAirport and arrivalAirport but also have date ranges that overlap.

重叠我的意思是一行的 effectiveDate 介于 effectiveDate 和 discontinuedDate 之间，另一条记录具有相同的值我提到的专栏.

By overlap I mean the effectiveDate for one row falls between the effectiveDate and discontinuedDate for another record that has the same values for the other columns I mentioned.

所以在我上面的例子中，前两行将被视为一个例子(并且都应该包含在新的数据框中)，但第三行不是.

So in my above example, the first two rows would be considered an example of this (and should both be included in the new dataframe), but the third row is not.

我假设我想使用 groupby，但我并不完全清楚我将应用什么聚合函数.以下是我目前所拥有的:

I'm assuming I want to use groupby, but I'm not entirely clear on what aggregation function I would apply. Below is what I have so far:

df.groupby(['carrier','flightnumber','departureAirport','arrivalAirport'])['effectiveDate', 'discontinuedDate'].min()

但显然我需要应用一个确定重叠的函数而不是min().我将如何识别重叠而不是返回该组的最小值?

but obviously I need to apply a function that determines overlap instead of min(). How would I go about identifying overlap instead of returning the minimum values for this group?

更新:

carrier flightnumber  departureAirport  arrivalAirport  effectiveDate discontinuedDate
4U      9748          DUS               GVA             2017-05-09    2017-07-12
4U      9748          DUS               GVA             2017-05-14    2017-07-16
4U      9748          DUS               GVA             2017-07-18    2017-08-27
AG      1234          SFO               DFW             2017-03-09    2017-05-12
AG      1234          SFO               DFW             2017-03-14    2017-05-16

更新 2:

就输出而言，我希望有任何重叠且具有相同值的 carrier、flightnumber、departureAirport和 arrivalAirport 在新数据帧中返回.这些行不需要包含任何其他数据.所以对于上面的示例数据，像下面这样的数据帧将是我想要的输出:

As far as output goes I'd like to have any rows that overlap and have the same values for carrier, flightnumber, departureAirport and arrivalAirport returned in a new dataframe. There does not need to be any additional data included for these rows. So for the above example data, a dataframe like the one below would be my desired output:

carrier flightnumber  departureAirport  arrivalAirport  effectiveDate discontinuedDate
4U      9748          DUS               GVA             2017-05-09    2017-07-12
4U      9748          DUS               GVA             2017-05-14    2017-07-16
AG      1234          SFO               DFW             2017-03-09    2017-05-12
AG      1234          SFO               DFW             2017-03-14    2017-05-16

请注意，仅排除了一条记录(9748 的第三条记录) - 这是因为它的日期范围与同一航班的其他记录不重叠.

Notice that only one record has been excluded (the third for 9748) - this is because it's date range does not overlap with other records for the same flight.

推荐答案

高级概念

按所有日期排序，然后在完全重叠时优先effectiveDate.
累计求和到交替的和在排序之前初始化的负值.关键是当累积总和高于 1 时会发生重叠.当总和下降到 0 时，连续组结束.
取消排序并确定出现零的位置......这些是重叠组的结尾.
在这些断点上拆分数据帧索引，只采用拆分大小大于 1 的拆分.
连接传递的分割并使用 loc 获取切片数据帧.

Sort by all dates and then by prioritizing effectiveDate if there is exact overlap.
Cumulatively sum to alternating ones and negative ones that were initialized prior to sorting. The point is that an overlap happens when the cumulative sum is above 1. A contiguous group ends when the sum drops to 0.
Unsort the sorting and identify where zeros happen... these are the end of overlapping groups.
Split the dataframe index on these break points and only take the splits where the size of the split is greater than 1.
Concatenate the passing splits and use loc to get the sliced dataframe.

def overlaping_groups(df):
    n = len(df)
    cols = ['effectiveDate', 'discontinuedDate']
    v = np.column_stack([df[c].values for c in cols]).ravel()
    i = np.tile([1, -1], n)
    a = np.lexsort([-i, v])
    u = np.empty_like(a)
    u[a] = np.arange(a.size)
    e = np.flatnonzero(i[a].cumsum()[u][1::2] == 0)
    d = np.diff(np.append(-1, e))
    s = np.split(df.index.values, e[:-1] + 1)

    return df.loc[np.concatenate([g for j, g in enumerate(s) if d[j] > 1])]

gcols = ['carrier', 'flightnumber', 'departureAirport', 'arrivalAirport']
df.groupby(gcols, group_keys=False).apply(overlaping_groups)

  carrier  flightnumber departureAirport arrivalAirport effectiveDate discontinuedDate
0      4U          9748              DUS            GVA    2017-05-09       2017-07-12
1      4U          9748              DUS            GVA    2017-05-14       2017-07-16
3      AG          1234              SFO            DFW    2017-03-09       2017-05-12
4      AG          1234              SFO            DFW    2017-03-14       2017-05-16

这篇关于如何查找日期范围重叠的行?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何查找日期范围重叠的行? [英] How to find rows with overlapping date ranges?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何查找日期范围重叠的行? [英] How to find rows with overlapping date ranges?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭