基于不规则的时间间隔合并 pandas 数据帧 [英] Merge pandas DataFrames based on irregular time intervals

查看:119
本文介绍了基于不规则的时间间隔合并 pandas 数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道如何加快两个数据帧的合并。其中一个数据框有时间标记的数据点( value col)。

  import pandas as pd 
import numpy as np

data = pd.DataFrame({'time':np.sort(np.random.uniform(0,100,size = 50)),
'value':np.random.uniform(-1,1,size = 50)})

另一个包含时间间隔信息( start_time end_time )和关联的 interval_id )。

  intervals = pd.DataFrame({'interval_id':np.arange(9) ,
'start_time':np.random.uniform(0,5,size = 9)+ np.arange(0,90,10),
'end_time':np.random.uniform(5 ,10,size = 9)+ np.arange(0,90,10)})

I 'd想要比以下循环更有效地合并这两个数据框:

  data ['interval_id'] = np.nan 
for index,ser in intervals.iterrows():
in_interval =(data ['time']> = ser ['start_time'])& \
(data ['time']< = ser ['end_time'])
data ['interval_id'] [in_interval] = ser ['interval_id']

result = data.merge(intervals,how ='outer')。sort('time')。reset_index(drop = True)

我一直在想象我可以使用pandas 时间序列功能,如日期范围或TimeGrouper,但我还没有找到比上述更多pythonic(pandas-y?)。



示例结果:

 时间值interval_id start_time end_time 
0 0.575976 0.022727 NaN NaN $
1 4.607545 0.222568 0 3.618715 8.294847
2 5.179350 0.438052 0 3.618715 8.294847
3 11.069956 0.641269 1 10.301728 19.870283
4 12.387854 0.344192 1 10.301728 19.870283
5 18.889691 0.582946 1 10.301728 19.870283
6 20.850469 -0.027436 NaN NaN NaN
7 23.199618 0.731316 2 21.488868 28.968338
8 26.631284 0.570647 2 21.488868 28.968338
9 26.996397 0.597035 2 21.488868 28.968338
10 28.601867 -0.131712 2 21.488868 28.968338
11 28.660986 0.710856 2 21.488868 28.968338
12 28.875395 -0.355208 2 21.488868 28.968338
13 28.959320 -0.430759 2 21.488868 28.968338
14 29.702800 -0.554742 NaN NaN NaN

任何时间精明的人都会非常感激您的建议。






更新后,杰夫的答案:

主要问题是 interval_id 与任何常规时间间隔无关(例如,间隔不总是大约10秒) DS)。一个间隔可能是10秒,接下来可能是2秒,接下来可能是100秒,所以我不能像杰夫提出的那样使用任何常规舍入方案。不幸的是,上面我最小的例子并没有说清楚。 你可以使用 np.searchsorted 来查找代表 data ['时间'] 将适合间隔['start_time'] 。然后,您可以再次调用 np.searchsorted 来查找代表 data ['time'] 中每个值的位置适合间隔['end_time'] 之间。请注意,使用 np.searchsorted 依赖于 interval ['start_time'] interval ['



对于数组中的每个对应位置,这两个索引相等时, data ['time'] 适用于 interval ['start_time'] interval ['end_time'] 。请注意,这依赖于不相交的时间间隔。



以这种方式使用 searchsorted 大约比使用 for-loop

 将pandas导入为pd 
导入numpy为np

np.random.seed(1)
data = pd.DataFrame({'time':np.sort(np.random.uniform(0,100,size = 50) )),
'value':np.random.uniform(-1,1,size = 50)})

intervals = pd.DataFrame(
{'interval_id' :np.arange(9),
'start_time':np.random.uniform(0,5,size = 9)+ np.arange(0,90,10),
'end_time': np.random.uniform(5,1​​0,size = 9)+ np.arange(0,90,10)})

def using_loop():
data ['interval_id'] = np.nan
for index,ser in intervals.iterrows():
in_interval =(data ['time']> = ser ['start_time'])& \
(data ['time']< = ser ['end_time'])
data ['interval_id'] [in_interval] = ser ['interval_id']

结果= data.merge(间隔,how ='outer')。sort('time')。reset_index(drop = True)
返回结果
$ b $ def using_searchsorted():
start_idx = np.searchsorted(intervals ['start_time']。values,data ['time']。values)-1
end_idx = np.searchsorted(intervals ['end_time']。values,data [时间']。值)
掩码=(start_idx == end_idx)
结果= data.copy()
结果['interval_id'] =结果['start_time'] =结果['' end_time'] = np.nan
result ['interval_id'] [mask] = start_idx
result.ix [mask,'start_time'] =间隔['start_time'] [start_idx [mask]]。值
result.ix [mask,'end_time'] =间隔['end_time'] [end_idx [mask]]。值
返回结果






  I n [254]:%timeit using_loop()
100个循环,最好3个循环:每个循环7.74 ms

在[255]中:%timeit using_searchsorted()
1000个循环,最好3:每循环1.56毫秒

在[256]:7.74 / 1.56
出[256]:4.961538461538462


I'm wondering how I can speed up a merge of two dataframes. One of the dataframes has time stamped data points (value col).

import pandas as pd
import numpy as np

data = pd.DataFrame({'time':np.sort(np.random.uniform(0,100,size=50)),
                     'value':np.random.uniform(-1,1,size=50)})

The other has time interval information (start_time, end_time, and associated interval_id).

intervals = pd.DataFrame({'interval_id':np.arange(9),
                          'start_time':np.random.uniform(0,5,size=9) + np.arange(0,90,10),    
                          'end_time':np.random.uniform(5,10,size=9) + np.arange(0,90,10)})

I'd like to merge these two dataframes more efficiently than the for loop below:

data['interval_id'] = np.nan
for index, ser in intervals.iterrows():
    in_interval = (data['time'] >= ser['start_time']) & \
                  (data['time'] <= ser['end_time'])
    data['interval_id'][in_interval] = ser['interval_id']

result = data.merge(intervals, how='outer').sort('time').reset_index(drop=True)

I keep imagining I'll be able to use pandas time series functionality, like a date range or TimeGrouper, but I have yet to figure out anything more pythonic (pandas-y?) than the above.

Example result:

     time      value     interval_id  start_time   end_time
0    0.575976  0.022727          NaN         NaN        NaN
1    4.607545  0.222568            0    3.618715   8.294847
2    5.179350  0.438052            0    3.618715   8.294847
3   11.069956  0.641269            1   10.301728  19.870283
4   12.387854  0.344192            1   10.301728  19.870283
5   18.889691  0.582946            1   10.301728  19.870283
6   20.850469 -0.027436          NaN         NaN        NaN
7   23.199618  0.731316            2   21.488868  28.968338
8   26.631284  0.570647            2   21.488868  28.968338
9   26.996397  0.597035            2   21.488868  28.968338
10  28.601867 -0.131712            2   21.488868  28.968338
11  28.660986  0.710856            2   21.488868  28.968338
12  28.875395 -0.355208            2   21.488868  28.968338
13  28.959320 -0.430759            2   21.488868  28.968338
14  29.702800 -0.554742          NaN         NaN        NaN

Any suggestions from time series-savvy people out there would be greatly appreciated.


Update, after Jeff's answer:

The main problem is that interval_id has no relation to any regular time interval (e.g., intervals are not always approximately 10 seconds). One interval could be 10 seconds, the next could be 2 seconds, and the next could be 100 seconds, so I can't use any regular rounding scheme as Jeff proposed. Unfortunately, my minimal example above does not make that clear.

解决方案

You could use np.searchsorted to find the indices representing where each value in data['time'] would fit between intervals['start_time']. Then you could call np.searchsorted again to find the indices representing where each value in data['time'] would fit between intervals['end_time']. Note that using np.searchsorted relies on interval['start_time'] and interval['end_time'] being in sorted order.

For each corresponding location in the arrays, where these two indices are equal, data['time'] fits in between interval['start_time'] and interval['end_time']. Note that this relies on the intervals being disjoint.

Using searchsorted in this way is about 5 times faster than using the for-loop:

import pandas as pd
import numpy as np

np.random.seed(1)
data = pd.DataFrame({'time':np.sort(np.random.uniform(0,100,size=50)),
                     'value':np.random.uniform(-1,1,size=50)})

intervals = pd.DataFrame(
    {'interval_id':np.arange(9),
     'start_time':np.random.uniform(0,5,size=9) + np.arange(0,90,10),    
     'end_time':np.random.uniform(5,10,size=9) + np.arange(0,90,10)})

def using_loop():
    data['interval_id'] = np.nan
    for index, ser in intervals.iterrows():
        in_interval = (data['time'] >= ser['start_time']) & \
                      (data['time'] <= ser['end_time'])
        data['interval_id'][in_interval] = ser['interval_id']

    result = data.merge(intervals, how='outer').sort('time').reset_index(drop=True)
    return result

def using_searchsorted():
    start_idx = np.searchsorted(intervals['start_time'].values, data['time'].values)-1
    end_idx = np.searchsorted(intervals['end_time'].values, data['time'].values)
    mask = (start_idx == end_idx)
    result = data.copy()
    result['interval_id'] = result['start_time'] = result['end_time'] = np.nan
    result['interval_id'][mask] = start_idx
    result.ix[mask, 'start_time'] = intervals['start_time'][start_idx[mask]].values
    result.ix[mask, 'end_time'] = intervals['end_time'][end_idx[mask]].values
    return result


In [254]: %timeit using_loop()
100 loops, best of 3: 7.74 ms per loop

In [255]: %timeit using_searchsorted()
1000 loops, best of 3: 1.56 ms per loop

In [256]: 7.74/1.56
Out[256]: 4.961538461538462

这篇关于基于不规则的时间间隔合并 pandas 数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆