在特定的ID列上合并两个DataFrame(数据集),但具有Date条件 [英] Merging two DataFrames (Datasets) on a specific ID column but with Date condition

查看:161
本文介绍了在特定的ID列上合并两个DataFrame(数据集),但具有Date条件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个数据集:

  1. 其中一份包含过去10年颁发的房屋能源证书,其中包含该房屋的ID及其发布日期.一所房子可以续签更多的证书.

  1. One contains house energy certificates issued the last 10 years with an ID for the house and the date it was issued. One house could have more certificates issued, as they can renew it.

另一个包含过去10年的所有房屋交易和ID(与第一个数据集中的ID相同)

The other contains all transactions of houses for the last 10 years and the ID (Which is the same id as in the first dataset)

然后我的问题是找到房屋出售当天的能源证明书价值.我可以合并房屋ID上的数据集,但不太确定要处理日期列.

My problem is then find the Energy certificate value of the house on the date it was being sold. I am able to merge the datasets on the house ID, but not quite sure to deal with the date column.

能源证书"的列为"DateIssued",而交易数据集的列为"OfficialDateSold".然后,条件是找到带有正确房屋ID的能源证书,然后找到最接近出售日期的日期,但不晚于日期.

The Energy Certificates has the column with the "DateIssued" and the Transaction data set has the column "OfficialDateSold". The conditions would then be to find the Energy certificate with the right House ID and then with the date closest to the sold date, but not after.

数据帧的片段:

Transactions:

         address_id sold_date
0        1223632151           NaN
1         160073875    2013-09-24
2         160073875    2010-06-16
3         160073875    2009-08-05
4         160073875    2006-12-18
...             ...           ...
2792726  2147477357    2011-11-03
2792727  2147477357    2014-02-26
2792728  2147477579    2017-05-24
2792729  2147479054    2013-02-04
2792730  2147482539    1993-08-10

Energy Certificate

                id   certificate_number      date_issued
0       1785963944   A2012-274656  27.11.2012 10:32:35
1        512265039     A2010-6435  30.06.2010 13:19:18
2       2003824679   A2014-459214  17.06.2014 11:00:47
3       1902877247   A2011-133593  14.10.2011 12:57:08
4       1620713314      A2009-266  25.12.2009 13:18:32
...            ...            ...                  ...
307846   753123775  A2019-1078357  30.11.2019 17:23:59
307847  1927124560  A2019-1078363  30.11.2019 20:44:22
307848  1122610963  A2019-1078371  30.11.2019 22:44:45
307849    28668673  A2019-1078373  30.11.2019 22:56:23
307850  1100393780  A2019-1078377  30.11.2019 23:38:42

想要输出

 id                certificate_number      date_issued    sold_date
 id = address_id

 date_issued <= sold_date 

还要找到最接近sold_date的证书(售出前最新的证书) (我知道日期必须使用相同的格式)

But also to find the Certificate closest to the sold_date(the newest before sold) (I know the dates must be in the same format)

我正在将Python与Jupyter Notebook一起使用.

I am using Python with Jupyter Notebook.

推荐答案

我认为您需要

I think you need merge_asof, but first is necessary convert columns to datetimess by to_datetime and remove rows with missing values in sold_date by DataFrame.dropna:

df1['sold_date'] = pd.to_datetime(df1['sold_date'])
df2['date_issued'] = pd.to_datetime(df2['date_issued'], dayfirst=True)
df1 = df1.dropna(subset=['sold_date'])


df = pd.merge_asof(df2.sort_values('date_issued'), 
                   df1.sort_values('sold_date'), 
                   left_on='date_issued',
                   right_on='sold_date',
                   left_by='id',
                   right_by='address_id')

这篇关于在特定的ID列上合并两个DataFrame(数据集),但具有Date条件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆