根据日期范围合并数据框 [英] Merging dataframes based on date range

查看:80
本文介绍了根据日期范围合并数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个熊猫数据帧:一个(df1)具有三列(StartDateEndDateID),第二个(df2)具有日期.我想基于df1.StartDatedf2.EndDate之间的df2.Date合并df1df2.

I have two pandas dataframes: one (df1) with three columns (StartDate, EndDate, and ID) and a second (df2) with a Date. I want to merge df1 and df2 based on df2.Date between df1.StartDate and df2.EndDate.

df1中的每个日期范围都是唯一的,并且不与数据框中的任何其他行重叠.

Each date range in df1 is unique and doesn't overlap with any of the other rows in the dataframe.

日期格式为YYYY-MM-DD.

推荐答案

仅提供使用np.piecewise的替代方法.性能甚至比np.searchedsort快.

Just to provide an alternative way using np.piecewise. The performance is even faster than np.searchedsort.

import pandas as pd
import numpy as np

# data
# ====================================
df1 = pd.DataFrame({'StartDate': pd.date_range('2010-01-01', periods=9, freq='5D'), 'EndDate': pd.date_range('2010-01-04', periods=9, freq='5D'), 'ID': np.arange(1, 10, 1)})

df2 = pd.DataFrame(dict(values=np.random.randn(50), date_time=pd.date_range('2010-01-01', periods=50, freq='D')))

df1.StartDate

Out[139]: 
0   2010-01-01
1   2010-01-06
2   2010-01-11
3   2010-01-16
4   2010-01-21
5   2010-01-26
6   2010-01-31
7   2010-02-05
8   2010-02-10
Name: StartDate, dtype: datetime64[ns]

df2.date_time

Out[140]: 
0    2010-01-01
1    2010-01-02
2    2010-01-03
3    2010-01-04
4    2010-01-05
5    2010-01-06
6    2010-01-07
7    2010-01-08
8    2010-01-09
9    2010-01-10
        ...    
40   2010-02-10
41   2010-02-11
42   2010-02-12
43   2010-02-13
44   2010-02-14
45   2010-02-15
46   2010-02-16
47   2010-02-17
48   2010-02-18
49   2010-02-19
Name: date_time, dtype: datetime64[ns]


df2['ID_matched'] = np.piecewise(np.zeros(len(df2)), [(df2.date_time.values >= start_date)&(df2.date_time.values <= end_date) for start_date, end_date in zip(df1.StartDate.values, df1.EndDate.values)], df1.ID.values)


Out[143]: 
    date_time  values  ID_matched
0  2010-01-01 -0.2240           1
1  2010-01-02 -0.4202           1
2  2010-01-03  0.9998           1
3  2010-01-04  0.4310           1
4  2010-01-05 -0.6509           0
5  2010-01-06 -1.4987           2
6  2010-01-07 -1.2306           2
7  2010-01-08  0.1940           2
8  2010-01-09 -0.9984           2
9  2010-01-10 -0.3676           0
..        ...     ...         ...
40 2010-02-10  0.5242           9
41 2010-02-11  0.3451           9
42 2010-02-12  0.7244           9
43 2010-02-13 -2.0404           9
44 2010-02-14 -1.0798           0
45 2010-02-15 -0.6934           0
46 2010-02-16 -2.3380           0
47 2010-02-17  1.6623           0
48 2010-02-18 -0.2754           0
49 2010-02-19 -0.7466           0

[50 rows x 3 columns]

%timeit df2['ID_matched'] = np.piecewise(np.zeros(len(df2)), [(df2.date_time.values >= start_date)&(df2.date_time.values <= end_date) for start_date, end_date in zip(df1.StartDate.values, df1.EndDate.values)], df1.ID.values)
1000 loops, best of 3: 466 µs per loop

这篇关于根据日期范围合并数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆