在特定的ID列上合并两个DataFrame(数据集),但具有Date条件 [英] Merging two DataFrames (Datasets) on a specific ID column but with Date condition
问题描述
我有两个数据集:
-
其中一份包含过去10年颁发的房屋能源证书,其中包含该房屋的ID及其发布日期.一所房子可以续签更多的证书.
One contains house energy certificates issued the last 10 years with an ID for the house and the date it was issued. One house could have more certificates issued, as they can renew it.
另一个包含过去10年的所有房屋交易和ID(与第一个数据集中的ID相同)
The other contains all transactions of houses for the last 10 years and the ID (Which is the same id as in the first dataset)
然后我的问题是找到房屋出售当天的能源证明书价值.我可以合并房屋ID上的数据集,但不太确定要处理日期列.
My problem is then find the Energy certificate value of the house on the date it was being sold. I am able to merge the datasets on the house ID, but not quite sure to deal with the date column.
能源证书"的列为"DateIssued",而交易数据集的列为"OfficialDateSold".然后,条件是找到带有正确房屋ID的能源证书,然后找到最接近出售日期的日期,但不晚于日期.
The Energy Certificates has the column with the "DateIssued" and the Transaction data set has the column "OfficialDateSold". The conditions would then be to find the Energy certificate with the right House ID and then with the date closest to the sold date, but not after.
数据帧的片段:
Transactions:
address_id sold_date
0 1223632151 NaN
1 160073875 2013-09-24
2 160073875 2010-06-16
3 160073875 2009-08-05
4 160073875 2006-12-18
... ... ...
2792726 2147477357 2011-11-03
2792727 2147477357 2014-02-26
2792728 2147477579 2017-05-24
2792729 2147479054 2013-02-04
2792730 2147482539 1993-08-10
Energy Certificate
id certificate_number date_issued
0 1785963944 A2012-274656 27.11.2012 10:32:35
1 512265039 A2010-6435 30.06.2010 13:19:18
2 2003824679 A2014-459214 17.06.2014 11:00:47
3 1902877247 A2011-133593 14.10.2011 12:57:08
4 1620713314 A2009-266 25.12.2009 13:18:32
... ... ... ...
307846 753123775 A2019-1078357 30.11.2019 17:23:59
307847 1927124560 A2019-1078363 30.11.2019 20:44:22
307848 1122610963 A2019-1078371 30.11.2019 22:44:45
307849 28668673 A2019-1078373 30.11.2019 22:56:23
307850 1100393780 A2019-1078377 30.11.2019 23:38:42
想要输出
id certificate_number date_issued sold_date
id = address_id
date_issued <= sold_date
还要找到最接近sold_date的证书(售出前最新的证书) (我知道日期必须使用相同的格式)
But also to find the Certificate closest to the sold_date(the newest before sold) (I know the dates must be in the same format)
我正在将Python与Jupyter Notebook一起使用.
I am using Python with Jupyter Notebook.
推荐答案
我认为您需要 to_datetime
,并通过
I think you need merge_asof
, but first is necessary convert columns to datetimes
s by to_datetime
and remove rows with missing values in sold_date
by DataFrame.dropna
:
df1['sold_date'] = pd.to_datetime(df1['sold_date'])
df2['date_issued'] = pd.to_datetime(df2['date_issued'], dayfirst=True)
df1 = df1.dropna(subset=['sold_date'])
df = pd.merge_asof(df2.sort_values('date_issued'),
df1.sort_values('sold_date'),
left_on='date_issued',
right_on='sold_date',
left_by='id',
right_by='address_id')
这篇关于在特定的ID列上合并两个DataFrame(数据集),但具有Date条件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!