如何查找日期范围重叠的行? [英] How to find rows with overlapping date ranges?

查看:69
本文介绍了如何查找日期范围重叠的行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含如下数据的数据框(数据的微小子集):

I have a dataframe that contains data like below (tiny subset of data):

我试图找出一种方法,我可以创建一个新的数据框,其中包含具有相同值的所有行:carrier, flightnumber, departureAirportarrivalAirport 但也有重叠的日期范围.

I'm trying to figure out a way where I can create a new dataframe that contains all rows that have the same values for : carrier, flightnumber, departureAirport and arrivalAirport but also have date ranges that overlap.

重叠我的意思是一行的 effectiveDate 介于 effectiveDatediscontinuedDate 之间,另一条记录具有相同的值我提到的专栏.

By overlap I mean the effectiveDate for one row falls between the effectiveDate and discontinuedDate for another record that has the same values for the other columns I mentioned.

所以在我上面的例子中,前两行将被视为一个例子(并且都应该包含在新的数据框中),但第三行不是.

So in my above example, the first two rows would be considered an example of this (and should both be included in the new dataframe), but the third row is not.

我假设我想使用 groupby,但我并不完全清楚我将应用什么聚合函数.以下是我目前所拥有的:

I'm assuming I want to use groupby, but I'm not entirely clear on what aggregation function I would apply. Below is what I have so far:

df.groupby(['carrier','flightnumber','departureAirport','arrivalAirport'])['effectiveDate', 'discontinuedDate'].min()

但显然我需要应用一个确定重叠的函数而不是min().我将如何识别重叠而不是返回该组的最小值?

but obviously I need to apply a function that determines overlap instead of min(). How would I go about identifying overlap instead of returning the minimum values for this group?

更新:

carrier flightnumber  departureAirport  arrivalAirport  effectiveDate discontinuedDate
4U      9748          DUS               GVA             2017-05-09    2017-07-12
4U      9748          DUS               GVA             2017-05-14    2017-07-16
4U      9748          DUS               GVA             2017-07-18    2017-08-27
AG      1234          SFO               DFW             2017-03-09    2017-05-12
AG      1234          SFO               DFW             2017-03-14    2017-05-16

更新 2:

就输出而言,我希望有任何重叠且具有相同值的 carrierflightnumberdepartureAirportarrivalAirport 在新数据帧中返回.这些行不需要包含任何其他数据.所以对于上面的示例数据,像下面这样的数据帧将是我想要的输出:

As far as output goes I'd like to have any rows that overlap and have the same values for carrier, flightnumber, departureAirport and arrivalAirport returned in a new dataframe. There does not need to be any additional data included for these rows. So for the above example data, a dataframe like the one below would be my desired output:

carrier flightnumber  departureAirport  arrivalAirport  effectiveDate discontinuedDate
4U      9748          DUS               GVA             2017-05-09    2017-07-12
4U      9748          DUS               GVA             2017-05-14    2017-07-16
AG      1234          SFO               DFW             2017-03-09    2017-05-12
AG      1234          SFO               DFW             2017-03-14    2017-05-16

请注意,仅排除了一条记录(9748 的第三条记录) - 这是因为它的日期范围与同一航班的其他记录不重叠.

Notice that only one record has been excluded (the third for 9748) - this is because it's date range does not overlap with other records for the same flight.

推荐答案

高级概念

  • 按所有日期排序,然后在完全重叠时优先effectiveDate.
  • 累计求和到交替的和在排序之前初始化的负值.关键是当累积总和高于 1 时会发生重叠.当总和下降到 0 时,连续组结束.
  • 取消排序并确定出现零的位置......这些是重叠组的结尾.
  • 在这些断点上拆分数据帧索引,只采用拆分大小大于 1 的拆分.
  • 连接传递的分割并使用 loc 获取切片数据帧.
  • Sort by all dates and then by prioritizing effectiveDate if there is exact overlap.
  • Cumulatively sum to alternating ones and negative ones that were initialized prior to sorting. The point is that an overlap happens when the cumulative sum is above 1. A contiguous group ends when the sum drops to 0.
  • Unsort the sorting and identify where zeros happen... these are the end of overlapping groups.
  • Split the dataframe index on these break points and only take the splits where the size of the split is greater than 1.
  • Concatenate the passing splits and use loc to get the sliced dataframe.
def overlaping_groups(df):
    n = len(df)
    cols = ['effectiveDate', 'discontinuedDate']
    v = np.column_stack([df[c].values for c in cols]).ravel()
    i = np.tile([1, -1], n)
    a = np.lexsort([-i, v])
    u = np.empty_like(a)
    u[a] = np.arange(a.size)
    e = np.flatnonzero(i[a].cumsum()[u][1::2] == 0)
    d = np.diff(np.append(-1, e))
    s = np.split(df.index.values, e[:-1] + 1)

    return df.loc[np.concatenate([g for j, g in enumerate(s) if d[j] > 1])]

gcols = ['carrier', 'flightnumber', 'departureAirport', 'arrivalAirport']
df.groupby(gcols, group_keys=False).apply(overlaping_groups)

  carrier  flightnumber departureAirport arrivalAirport effectiveDate discontinuedDate
0      4U          9748              DUS            GVA    2017-05-09       2017-07-12
1      4U          9748              DUS            GVA    2017-05-14       2017-07-16
3      AG          1234              SFO            DFW    2017-03-09       2017-05-12
4      AG          1234              SFO            DFW    2017-03-14       2017-05-16

这篇关于如何查找日期范围重叠的行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆