合并带有时间戳和间隔的数据帧 [英] merge dataframes with timestamps and intervals

查看:88
本文介绍了合并带有时间戳和间隔的数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个数据帧。

df1包含数字和时间戳。

df1contains number and timestamps. It is a very large set.

df1.head()
Out[292]: 
2016-08-31 08:09:00     1.0
2016-08-31 08:11:00     7.0
2016-08-31 08:14:00    90.0

df2包含间隔

d2.head()
Out[292]:        
   start                 stop                         C
2016-08-31 08:09:00     2016-08-31 08:12:00         'a'
2016-08-31 08:13:00     2016-08-31 08:20:00         'b'
2016-08-31 08:20:00     2016-08-31 08:45:00         'c'

我想向其中添加新列 C df1 ,这样 C 的值对应于 df2 包含索引 df1 的时间间隔。

I would like to add a new column C to df1 such that the value of C is corresponding to the value in df2 for the interval which contained the index of df1.

预期结果

 df1.head()
    Out[292]:                      C
    2016-08-31 08:09:00     1.0   'a'
    2016-08-31 08:11:00     7.0   'a'
    2016-08-31 08:14:00    90.0   'b'

到目前为止,我已经尝试过:

So far, I have tried:

 df1.loc[:,'C']=df1.index.map(lambda i:df2[np.logical_and(i>df2.starti<df2.stop)].C)

,但是效率极低并且崩溃在某些情况下,df2的间隔列表中缺少索引值。

but it is extremely inefficient and crashes for some cases where there are values of the index which are missing from the interval list in df2.

如何有效地做到这一点?

How to do this efficiently?

推荐答案

通过IntervalIndex /stable/generated/pandas.IntervalIndex.from_arrays.html rel = nofollow noreferrer> IntervalIndex.from_arrays 首先:

Create IntervalIndex by IntervalIndex.from_arrays first:

s = pd.IntervalIndex.from_arrays(df2['start'], df2['stop'], 'both')
print (s)
IntervalIndex([[2016-08-31 08:09:00, 2016-08-31 08:12:00], 
               [2016-08-31 08:13:00, 2016-08-31 08:20:00],
               [2016-08-31 08:20:00, 2016-08-31 08:45:00]],
              closed='both',
              dtype='interval[datetime64[ns]]')

然后 set_index 由新的 IntervalIndex 设置为由

Then set_index by new IntervalIndex set to new column by array created by values:

df1['C'] = df2.set_index(s).loc[df1.index, 'C'].values
print (df1)
                        A  C
2016-08-31 08:09:00   1.0  a
2016-08-31 08:11:00   7.0  a
2016-08-31 08:14:00  90.0  b

编辑:

s = pd.IntervalIndex.from_arrays(df2['start'].astype(np.int64), 
                                 df2['stop'].astype(np.int64), 'both')
print (s)
IntervalIndex([[1472630940000000000, 1472631120000000000], 
               [1472631180000000000, 1472631600000000000], 
               [1472631600000000000, 1472633100000000000]],
              closed='both',
              dtype='interval[int64]')

df1['C'] = df2.set_index(s).loc[df1.index.astype(np.int64), 'C'].values
print (df1)
                        A  C
2016-08-31 08:09:00   1.0  a
2016-08-31 08:11:00   7.0  a
2016-08-31 08:14:00  90.0  b

这篇关于合并带有时间戳和间隔的数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆