带有日期范围列的内部连接 ​​Pandas 数据框 [英] Inner join pandas dataframes with date range columns

查看:82
本文介绍了带有日期范围列的内部连接 ​​Pandas 数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有 2 个带日期范围的数据框

I have 2 data frames with dateranges

>>>>df1 = pd.DataFrame({'ID': ['a', 'a', 'a', 'a', 'b', 'b', 'b'],
          'from_dt':[pd.to_datetime('2004-01-01 00:00:00'),
                     pd.to_datetime('2005-01-01 00:00:00'),
                     pd.to_datetime('2007-01-01 00:00:00'),
                     pd.to_datetime('2011-01-01 00:00:00'),
                     pd.to_datetime('2004-01-01 00:00:00'),
                     pd.to_datetime('2012-01-01 00:00:00'),
                     pd.to_datetime('2019-01-01 00:00:00'),
                     ],
         'thru_dt':[pd.to_datetime('2004-12-31 23:59:59'),
                    pd.to_datetime('2006-12-31 23:59:59'),
                    pd.to_datetime('2010-12-31 23:59:59'),
                    pd.to_datetime('2075-12-31 23:59:59'),
                    pd.to_datetime('2011-12-31 23:59:59'),
                    pd.to_datetime('2018-12-31 23:59:59'),
                    pd.to_datetime('2075-12-31 23:59:59'),
                    ],
        'val':[1,2,3,4,5,6,7]})
>>>>df1
  ID    from_dt             thru_dt  val
0  a 2004-01-01 2004-12-31 23:59:59    1
1  a 2005-01-01 2006-12-31 23:59:59    2
2  a 2007-01-01 2010-12-31 23:59:59    3
3  a 2011-01-01 2075-12-31 23:59:59    4
4  b 2004-01-01 2011-12-31 23:59:59    5
5  b 2012-01-01 2018-12-31 23:59:59    6
6  b 2019-01-01 2075-12-31 23:59:59    7

第一个包含 ID 到值的映射

The first one contains a map of ID to Value

>>>>df2 = pd.DataFrame({'ID':['a', 'a', 'b'], 'ID2':['A1', 'A2', 'B1'],
                'from_dt':[pd.to_datetime('2003-01-01 00:00:00'),
                        pd.to_datetime('2010-01-01 00:00:00'),
                        pd.to_datetime('2005-01-01 00:00:00'),
                        ],
                'thru_dt':[pd.to_datetime('2009-12-31 23:59:59'),
                           pd.to_datetime('2075-12-31 23:59:59'),
                           pd.to_datetime('2075-12-31 23:59:59'),
                           ]
                })
>>>>df2
  ID ID2    from_dt             thru_dt
0  a  A1 2003-01-01 2009-12-31 23:59:59
1  a  A2 2010-01-01 2075-12-31 23:59:59
2  b  B1 2005-01-01 2075-12-31 23:59:59

第二个数据帧的一个 ID 映射到另一个.

The second dataframe has one ID mapped to the other.

我想用一些内部连接逻辑将它们连接起来,以便我有一个具有以下输出的平面文件数据框

I'd like to join them with some inner join logic so that I have a flat file dataframe which has the following output

>>>>df3 = some_function(df1, df2, end_date=pd.to_datetime('2020-12-31'))

>>>>df3
  ID2       Date  val
0  A1 2004-01-01    1
1  A1 2004-01-02    1
.
.
x  A1 2004-12-31    1
x  A1 2005-01-01    2
x  A1 2005-01-02    2
.
.
x  A1 2009-12-31    3
x  A2 2010-01-01    3
.
.
x  A2 2020-12-31    4
x  B1 2005-01-01    5
.
.
x  B1 2020-12-31    7

我确信我可以循环并以某种低效的方式执行此操作.我很想了解可以处理此类任务的任何现有工具和库.

I'm sure I can loop over and do this in some inefficient manner. I'd love to learn about any existing tools and lib that can handle such tasks.

谢谢!

推荐答案

我认为您可以使用列表理解来展平日期范围,然后按结束日期时间过滤:

I think you can use list comprehension for flatten date ranges and then filter by end datetime:

s1 = pd.concat([pd.Series(r.Index,pd.date_range(r.from_dt, r.thru_dt)) 
                     for r in df1.itertuples()])
df11 = df1.drop(['from_dt','thru_dt'], 1).join(pd.DataFrame({'Date':s1.index}, index=s1))
df11 = df11[df11['Date'].lt('2020-12-31')]
print (df11)

s2 = pd.concat([pd.Series(r.Index,pd.date_range(r.from_dt, r.thru_dt)) 
                     for r in df2.itertuples()])
df22 = df2.drop(['from_dt','thru_dt'], 1).join(pd.DataFrame({'Date':s2.index}, index=s2))
df22 = df22[df22['Date'].lt('2020-12-31')]
print (df22)

这篇关于带有日期范围列的内部连接 ​​Pandas 数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆