pandas 按最接近的时间合并数据帧 [英] pandas merge dataframes by closest time

查看:58
本文介绍了 pandas 按最接近的时间合并数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个数据帧(logsfailures),我想合并两个数据帧,以便在logs中添加一列,该列的值与失败"中的最接近日期. /p>

生成logsfailures和所需的output的代码如下:

import pandas as pd
logs=pd.DataFrame({'date-time':pd.Series(['23/10/2015 10:20:54','22/10/2015 09:51:32','21/10/2015 06:51:32','28/10/2015 16:59:32','25/10/2015 04:41:32','24/10/2015 11:50:11']),'var1':pd.Series([0,1,3,1,2,4])})
logs['date-time']=pd.to_datetime(logs['date-time'])
failures=pd.DataFrame({'date':pd.Series(['23/10/2015 00:00:00','22/10/2015 00:00:00','21/10/2015 00:00:00']),'failure':pd.Series([1,1,1])})
failures['date']=pd.to_datetime(failures['date'])
output=pd.DataFrame({'date-time':pd.Series(['23/10/2015 10:20:54','22/10/2015 09:51:32','21/10/2015 06:51:32','28/10/2015 16:59:32','25/10/2015 04:41:32','24/10/2015 11:50:11']),'var1':pd.Series([0,1,3,1,2,4]),'closest_failure':pd.Series(['23/10/2015 00:00:00','22/10/2015 00:00:00','21/10/2015 00:00:00','23/10/2015 00:00:00','23/10/2015 00:00:00','23/10/2015 00:00:00'])})
output['date-time']=pd.to_datetime(output['date-time'])

有什么想法吗?真实的数据集非常大,因此效率也是一个问题.

解决方案

您可以使用方法="nearest"重新编制索引.可能有一种更整洁的方法,但是将带有失败日志的系列与索引和值一起使用是可行的:

In [11]: failures_dt = pd.Series(failures["date"].values, failures["date"])

In [12]: failures_dt.reindex(logs["date-time"], method="nearest")
Out[12]:
date-time
2015-10-23 10:20:54   2015-10-23
2015-10-22 09:51:32   2015-10-22
2015-10-21 06:51:32   2015-10-21
2015-10-28 16:59:32   2015-10-23
2015-10-25 04:41:32   2015-10-23
2015-10-24 11:50:11   2015-10-23
dtype: datetime64[ns]

In [13]: logs["nearest"] = failures_dt.reindex(logs["date-time"], method="nearest").values

In [14]: logs
Out[14]:
            date-time  var1    nearest
0 2015-10-23 10:20:54     0 2015-10-23
1 2015-10-22 09:51:32     1 2015-10-22
2 2015-10-21 06:51:32     3 2015-10-21
3 2015-10-28 16:59:32     1 2015-10-23
4 2015-10-25 04:41:32     2 2015-10-23
5 2015-10-24 11:50:11     4 2015-10-23

I've got two dataframes (logs and failures), which I would like to merge so that I add in logs a column which has the value of the closest date found in 'failures'.

The code to generate logs, failures, and the desired output is below:

import pandas as pd
logs=pd.DataFrame({'date-time':pd.Series(['23/10/2015 10:20:54','22/10/2015 09:51:32','21/10/2015 06:51:32','28/10/2015 16:59:32','25/10/2015 04:41:32','24/10/2015 11:50:11']),'var1':pd.Series([0,1,3,1,2,4])})
logs['date-time']=pd.to_datetime(logs['date-time'])
failures=pd.DataFrame({'date':pd.Series(['23/10/2015 00:00:00','22/10/2015 00:00:00','21/10/2015 00:00:00']),'failure':pd.Series([1,1,1])})
failures['date']=pd.to_datetime(failures['date'])
output=pd.DataFrame({'date-time':pd.Series(['23/10/2015 10:20:54','22/10/2015 09:51:32','21/10/2015 06:51:32','28/10/2015 16:59:32','25/10/2015 04:41:32','24/10/2015 11:50:11']),'var1':pd.Series([0,1,3,1,2,4]),'closest_failure':pd.Series(['23/10/2015 00:00:00','22/10/2015 00:00:00','21/10/2015 00:00:00','23/10/2015 00:00:00','23/10/2015 00:00:00','23/10/2015 00:00:00'])})
output['date-time']=pd.to_datetime(output['date-time'])

Any ideas? The real dataset is very large, so efficiency is also a concern.

解决方案

You can reindex with method="nearest". There may be a neater way, but using a Series with the failure logs in the index and values works:

In [11]: failures_dt = pd.Series(failures["date"].values, failures["date"])

In [12]: failures_dt.reindex(logs["date-time"], method="nearest")
Out[12]:
date-time
2015-10-23 10:20:54   2015-10-23
2015-10-22 09:51:32   2015-10-22
2015-10-21 06:51:32   2015-10-21
2015-10-28 16:59:32   2015-10-23
2015-10-25 04:41:32   2015-10-23
2015-10-24 11:50:11   2015-10-23
dtype: datetime64[ns]

In [13]: logs["nearest"] = failures_dt.reindex(logs["date-time"], method="nearest").values

In [14]: logs
Out[14]:
            date-time  var1    nearest
0 2015-10-23 10:20:54     0 2015-10-23
1 2015-10-22 09:51:32     1 2015-10-22
2 2015-10-21 06:51:32     3 2015-10-21
3 2015-10-28 16:59:32     1 2015-10-23
4 2015-10-25 04:41:32     2 2015-10-23
5 2015-10-24 11:50:11     4 2015-10-23

这篇关于 pandas 按最接近的时间合并数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆