Pandas:按以下方式计算一组上的时间间隔交叉点 [英] Pandas: Count time interval intersections over a group by

查看:81
本文介绍了Pandas:按以下方式计算一组上的时间间隔交叉点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下形式的数据框

import pandas as pd

Out[1]:
df = pd.DataFrame({'id':[1,2,3,4,5],
          'group':['A','A','A','B','B'],
          'start':['2012-08-19','2012-08-22','2013-08-19','2012-08-19','2013-08-19'],
          'end':['2012-08-28','2013-09-13','2013-08-19','2012-12-19','2014-08-19']})

     id group       start         end
0   1     A  2012-08-19  2012-08-28
1   2     A  2012-08-22  2013-09-13
2   3     A  2013-08-19  2013-08-21
3   4     B  2012-08-19  2012-12-19
4   5     B  2013-08-19  2014-08-19

对于我的数据框中的给定行,我想计算同一组中具有重叠时间间隔的项目数.

For given row in my dataframe I'd like to count the number of items in the same group that have an overlapping time interval.

例如,在 A 组中,id 2 的范围为 2012 年 8 月 22 日至 2013 年 9 月 13 日,因此 id 1(2012 年 8 月 19 日至 2012 年 8 月 28 日)和 id 3(2013 年 8 月 19 日至 2013 年 8 月 21 日)之间的重叠计数为 2.

For example in group A id 2 ranges from 22 August 2012 to 13 Sept 2013 and hence the overlap between id 1 (19 August 2012 to 28 August 2012) and also id 3 (19 August 2013 to 21 August 2013) for a count of 2.

相反,B组中的项目之间没有重叠

Conversely there is no overlap between the items in group B

所以对于我上面的示例数据框,我想生成类似

So for my example dataframe above i'd like to produce something like

Out[2]:
   id group       start         end  count
0   1     A  2012-08-19  2012-08-28      1
1   2     A  2012-08-22  2013-09-13      2
2   3     A  2013-08-19  2013-08-21      1
3   4     B  2012-08-19  2012-12-19      0
4   5     B  2013-08-19  2014-08-19      0

我可以蛮力"执行此操作,但我想知道是否有更有效的 Pandas 方法来完成此操作.

I could "brute-force" this but I'd like to know if there is a more efficient Pandas way of getting this done.

预先感谢您的帮助

推荐答案

所以,我想看看蛮力是如何公平的……如果速度很慢,我会用cythonize 这个逻辑.还不错,虽然组大小为 O(M^2),但如果有很多小组,那可能还不错.

So, I would see how brute force fairs... if it's slow I'd cythonize this logic. It's not so bad, as whilst O(M^2) in group size, if there's lots of small groups it might not be so bad.

In [11]: def interval_overlaps(a, b):
    ...:     return min(a["end"], b["end"]) - max(a["start"], b["start"]) > np.timedelta64(-1)


In [12]: def count_overlaps(df1):
    ...:     return sum(interval_overlaps(df1.iloc[i], df1.iloc[j]) for i in range(len(df1) - 1) for j in range(i, len(df1)) if i < j)

In [13]: df.groupby("group").apply(count_overlaps)
Out[13]:
group
A    2
B    0
dtype: int64

前者是对这个区间重叠函数的调整.

重新阅读时,count_overlaps 看起来像是每行,而不是每组,所以 agg 函数应该更像:

Upon re-reading it looks like the count_overlaps is per-row, rather than per-group, so the agg function should be more like:

In [21]: def count_overlaps(df1):
    ...:     return pd.Series([df1.apply(lambda x: interval_overlaps(x, df1.iloc[i]), axis=1).sum() - 1 for i in range(len(df1))], df1.index)

In [22]: df.groupby("group").apply(count_overlaps)
Out[22]:
group
A      0    1
       1    2
       2    1
B      3    0
       4    0
dtype: int64

In [22]: df["count"] = df.groupby("group").apply(count_overlaps).values

In [23]: df
Out[23]:
         end group  id      start  count
0 2012-08-28     A   1 2012-08-19      1
1 2013-09-13     A   2 2012-08-22      2
2 2013-08-19     A   3 2013-08-19      1
3 2012-12-19     B   4 2012-08-19      0
4 2014-08-19     B   5 2013-08-19      0

这篇关于Pandas:按以下方式计算一组上的时间间隔交叉点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆