如何根据最近的(或最近的)时间戳合并两个数据帧 [英] How to merge two dataframes based on the closest (or most recent) timestamp
问题描述
假设我有一个数据框df1,其列为'A'和'B'. A是一列时间戳记(例如unixtime),而'B'是一列值.
Suppose I have a dataframe df1, with columns 'A' and 'B'. A is a column of timestamps (e.g. unixtime) and 'B' is a column of some value.
假设我也有一个数据框df2,其列为'C'和'D'. C也是unixtime列,D是包含其他一些值的列.
Suppose I also have a dataframe df2 with columns 'C' and 'D'. C is also a unixtime column and D is a column containing some other values.
我想用timestamp
上的联接来模糊merge
数据框.但是,如果时间戳不匹配(它们很可能不匹配),我希望它可以在"A"中可以在"C"中找到的时间戳之前最接近的条目上合并.
I would like to fuzzy merge
the dataframes with a join on the timestamp
. However, if the timestamps don't match (which they most likely don't), I would like it to merge on the closest entry before the timestamp in 'A' that it can find in 'C'.
pd.merge不支持此功能,我发现自己使用to_dict()从数据框转换而来,并使用一些迭代来解决此问题.熊猫有办法解决这个问题吗?
pd.merge does not support this, and I find myself converting away from dataframes using to_dict(), and using some iteration to solve this. Is there a way in pandas to solve this?
推荐答案
numpy.searchsorted()
finds the appropriate index
positions to merge
on (see docs) - hope the below get you closer to what you're looking for:
start = datetime(2015, 12, 1)
df1 = pd.DataFrame({'A': [start + timedelta(minutes=randrange(60)) for i in range(10)], 'B': [1] * 10}).sort_values('A').reset_index(drop=True)
df2 = pd.DataFrame({'C': [start + timedelta(minutes=randrange(60)) for i in range(10)], 'D': [2] * 10}).sort_values('C').reset_index(drop=True)
df2.index = np.searchsorted(df1.A.values, df2.C.values)
print(pd.merge(left=df1, right=df2, left_index=True, right_index=True, how='left'))
A B C D
0 2015-12-01 00:01:00 1 NaT NaN
1 2015-12-01 00:02:00 1 2015-12-01 00:02:00 2
2 2015-12-01 00:02:00 1 NaT NaN
3 2015-12-01 00:12:00 1 2015-12-01 00:05:00 2
4 2015-12-01 00:16:00 1 2015-12-01 00:14:00 2
4 2015-12-01 00:16:00 1 2015-12-01 00:14:00 2
5 2015-12-01 00:28:00 1 2015-12-01 00:22:00 2
6 2015-12-01 00:30:00 1 NaT NaN
7 2015-12-01 00:39:00 1 2015-12-01 00:31:00 2
7 2015-12-01 00:39:00 1 2015-12-01 00:39:00 2
8 2015-12-01 00:55:00 1 2015-12-01 00:40:00 2
8 2015-12-01 00:55:00 1 2015-12-01 00:46:00 2
8 2015-12-01 00:55:00 1 2015-12-01 00:54:00 2
9 2015-12-01 00:57:00 1 NaT NaN
这篇关于如何根据最近的(或最近的)时间戳合并两个数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!