pandas.merge:匹配最近的时间戳> =一系列时间戳 [英] pandas.merge: match the nearest time stamp >= the series of timestamps
问题描述
我有两个数据帧,两个数据帧均包含不规则间隔的毫秒分辨率时间戳记列.我的目标是匹配行,以便对于每个匹配的行,1)第一个时间戳始终小于或等于第二个时间戳,以及2)对于满足1)的所有时间戳对,匹配的时间戳都是最接近的.
I have two dataframes, both of which contain an irregularly spaced, millisecond resolution timestamp column. My goal here is to match up the rows so that for each matched row, 1) the first time stamp is always smaller or equal to the second timestamp, and 2) the matched timestamps are the closest for all pairs of timestamps satisfying 1).
pandas.merge有什么办法吗?
Is there any way to do this with pandas.merge?
推荐答案
merge()
无法执行这种连接,但是可以使用searchsorted()
:
merge()
can't do this kind of join, but you can use searchsorted()
:
创建一些随机时间戳:t1
,t2
,按升序排列:
Create some random timestamps: t1
, t2
, there are in ascending order:
import pandas as pd
import numpy as np
np.random.seed(0)
base = np.array(["2013-01-01 00:00:00"], "datetime64[ns]")
a = (np.random.rand(30)*1000000*1000).astype(np.int64)*1000000
t1 = base + a
t1.sort()
b = (np.random.rand(10)*1000000*1000).astype(np.int64)*1000000
t2 = base + b
t2.sort()
调用searchsorted()
在t1
中为t2
中的每个值查找索引:
call searchsorted()
to find index in t1
for every value in t2
:
idx = np.searchsorted(t1, t2) - 1
mask = idx >= 0
df = pd.DataFrame({"t1":t1[idx][mask], "t2":t2[mask]})
这是输出:
t1 t2
0 2013-01-02 06:49:13.287000 2013-01-03 16:29:15.612000
1 2013-01-05 16:33:07.211000 2013-01-05 21:42:30.332000
2 2013-01-07 04:47:24.561000 2013-01-07 04:53:53.948000
3 2013-01-07 14:26:03.376000 2013-01-07 17:01:35.722000
4 2013-01-07 14:26:03.376000 2013-01-07 18:22:13.996000
5 2013-01-07 14:26:03.376000 2013-01-07 18:33:55.497000
6 2013-01-08 02:24:54.113000 2013-01-08 12:23:40.299000
7 2013-01-08 21:39:49.366000 2013-01-09 14:03:53.689000
8 2013-01-11 08:06:36.638000 2013-01-11 13:09:08.078000
要通过图形查看此结果:
To view this result by graph:
import pylab as pl
pl.figure(figsize=(18, 4))
pl.vlines(pd.Series(t1), 0, 1, colors="g", lw=1)
pl.vlines(df.t1, 0.3, 0.7, colors="r", lw=2)
pl.vlines(df.t2, 0.3, 0.7, colors="b", lw=2)
pl.margins(0.02)
输出:
绿线是t1
,蓝线是t2
,每个t2
都从t1
中选择红线.
The green lines are t1
, blue lines are t2
, red lines are selected from t1
for every t2
.
这篇关于pandas.merge:匹配最近的时间戳> =一系列时间戳的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!