pandas.merge:匹配最近的时间戳> =一系列时间戳 [英] pandas.merge: match the nearest time stamp >= the series of timestamps

查看:178
本文介绍了pandas.merge:匹配最近的时间戳> =一系列时间戳的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个数据帧,两个数据帧均包含不规则间隔的毫秒分辨率时间戳记列.我的目标是匹配行,以便对于每个匹配的行,1)第一个时间戳始终小于或等于第二个时间戳,以及2)对于满足1)的所有时间戳对,匹配的时间戳都是最接近的.

I have two dataframes, both of which contain an irregularly spaced, millisecond resolution timestamp column. My goal here is to match up the rows so that for each matched row, 1) the first time stamp is always smaller or equal to the second timestamp, and 2) the matched timestamps are the closest for all pairs of timestamps satisfying 1).

pandas.merge有什么办法吗?

Is there any way to do this with pandas.merge?

推荐答案

merge()无法执行这种连接,但是可以使用searchsorted():

merge() can't do this kind of join, but you can use searchsorted():

创建一些随机时间戳:t1t2,按升序排列:

Create some random timestamps: t1, t2, there are in ascending order:

import pandas as pd
import numpy as np
np.random.seed(0)

base = np.array(["2013-01-01 00:00:00"], "datetime64[ns]")

a = (np.random.rand(30)*1000000*1000).astype(np.int64)*1000000
t1 = base + a
t1.sort()

b = (np.random.rand(10)*1000000*1000).astype(np.int64)*1000000
t2 = base + b
t2.sort()

调用searchsorted()t1中为t2中的每个值查找索引:

call searchsorted() to find index in t1 for every value in t2:

idx = np.searchsorted(t1, t2) - 1
mask = idx >= 0

df = pd.DataFrame({"t1":t1[idx][mask], "t2":t2[mask]})

这是输出:

                         t1                         t2
0 2013-01-02 06:49:13.287000 2013-01-03 16:29:15.612000
1 2013-01-05 16:33:07.211000 2013-01-05 21:42:30.332000
2 2013-01-07 04:47:24.561000 2013-01-07 04:53:53.948000
3 2013-01-07 14:26:03.376000 2013-01-07 17:01:35.722000
4 2013-01-07 14:26:03.376000 2013-01-07 18:22:13.996000
5 2013-01-07 14:26:03.376000 2013-01-07 18:33:55.497000
6 2013-01-08 02:24:54.113000 2013-01-08 12:23:40.299000
7 2013-01-08 21:39:49.366000 2013-01-09 14:03:53.689000
8 2013-01-11 08:06:36.638000 2013-01-11 13:09:08.078000

要通过图形查看此结果:

To view this result by graph:

import pylab as pl
pl.figure(figsize=(18, 4))
pl.vlines(pd.Series(t1), 0, 1, colors="g", lw=1)
pl.vlines(df.t1, 0.3, 0.7, colors="r", lw=2)
pl.vlines(df.t2, 0.3, 0.7, colors="b", lw=2)
pl.margins(0.02)

输出:

绿线是t1,蓝线是t2,每个t2都从t1中选择红线.

The green lines are t1, blue lines are t2, red lines are selected from t1 for every t2.

这篇关于pandas.merge:匹配最近的时间戳> =一系列时间戳的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆