基于不规则的时间间隔合并 pandas 数据帧 [英] Merge pandas DataFrames based on irregular time intervals

查看：119 发布时间：2018/5/30 13:37:56 python-2.7 merge pandas time-series group-by

本文介绍了基于不规则的时间间隔合并 pandas 数据帧的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想知道如何加快两个数据帧的合并。其中一个数据框有时间标记的数据点（ value col）。

import pandas as pd import numpy as np data = pd.DataFrame（{'time'：np.sort（np.random.uniform（0,100，size = 50））， 'value'：np.random.uniform（-1,1，size = 50）}）
另一个包含时间间隔信息（ start_time ， end_time ）和关联的 interval_id ）。
intervals = pd.DataFrame（{'interval_id'：np.arange（9）， 'start_time'：np.random.uniform（0,5，size = 9）+ np.arange（0,90,10）， 'end_time'：np.random.uniform（5 ，10，size = 9）+ np.arange（0,90,10）}）
I 'd想要比以下循环更有效地合并这两个数据框：
data ['interval_id'] = np.nan for index，ser in intervals.iterrows（）： in_interval =（data ['time']> = ser ['start_time']）& \ （data ['time']< = ser ['end_time']） data ['interval_id'] [in_interval] = ser ['interval_id'] result = data.merge（intervals，how ='outer'）。sort（'time'）。reset_index（drop = True）
我一直在想象我可以使用pandas 时间序列功能，如日期范围或TimeGrouper，但我还没有找到比上述更多pythonic（pandas-y？）。
示例结果：时间值interval_id start_time end_time 0 0.575976 0.022727 NaN NaN $ 1 4.607545 0.222568 0 3.618715 8.294847 2 5.179350 0.438052 0 3.618715 8.294847 3 11.069956 0.641269 1 10.301728 19.870283 4 12.387854 0.344192 1 10.301728 19.870283 5 18.889691 0.582946 1 10.301728 19.870283 6 20.850469 -0.027436 NaN NaN NaN 7 23.199618 0.731316 2 21.488868 28.968338 8 26.631284 0.570647 2 21.488868 28.968338 9 26.996397 0.597035 2 21.488868 28.968338 10 28.601867 -0.131712 2 21.488868 28.968338 11 28.660986 0.710856 2 21.488868 28.968338 12 28.875395 -0.355208 2 21.488868 28.968338 13 28.959320 -0.430759 2 21.488868 28.968338 14 29.702800 -0.554742 NaN NaN NaN 任何时间精明的人都会非常感激您的建议。更新后，杰夫的答案：主要问题是 interval_id 与任何常规时间间隔无关（例如，间隔不总是大约10秒） DS）。一个间隔可能是10秒，接下来可能是2秒，接下来可能是100秒，所以我不能像杰夫提出的那样使用任何常规舍入方案。不幸的是，上面我最小的例子并没有说清楚。你可以使用 np.searchsorted 来查找代表 data ['时间'] 将适合间隔['start_time'] 。然后，您可以再次调用 np.searchsorted 来查找代表 data ['time'] 中每个值的位置适合间隔['end_time'] 之间。请注意，使用 np.searchsorted 依赖于 interval ['start_time'] 和 interval [' 对于数组中的每个对应位置，这两个索引相等时， data ['time'] 适用于 interval ['start_time'] 和 interval ['end_time'] 。请注意，这依赖于不相交的时间间隔。以这种方式使用 searchsorted 大约比使用 for-loop ：将pandas导入为pd 导入numpy为np np.random.seed（1） data = pd.DataFrame（{'time'：np.sort（np.random.uniform（0,100，size = 50）））， 'value'：np.random.uniform（-1,1，size = 50）}） intervals = pd.DataFrame（ {'interval_id' ：np.arange（9）， 'start_time'：np.random.uniform（0,5，size = 9）+ np.arange（0,90,10）， 'end_time'： np.random.uniform（5,10，size = 9）+ np.arange（0,90,10）}） def using_loop（）： data ['interval_id'] = np.nan for index，ser in intervals.iterrows（）： in_interval =（data ['time']> = ser ['start_time']）& \ （data ['time']< = ser ['end_time']） data ['interval_id'] [in_interval] = ser ['interval_id'] 结果= data.merge（间隔，how ='outer'）。sort（'time'）。reset_index（drop = True）返回结果 $ b $ def using_searchsorted（）： start_idx = np.searchsorted（intervals ['start_time']。values，data ['time']。values）-1 end_idx = np.searchsorted（intervals ['end_time']。values，data [时间']。值）掩码=（start_idx == end_idx）结果= data.copy（）结果['interval_id'] =结果['start_time'] =结果['' end_time'] = np.nan result ['interval_id'] [mask] = start_idx result.ix [mask，'start_time'] =间隔['start_time'] [start_idx [mask]]。值 result.ix [mask，'end_time'] =间隔['end_time'] [end_idx [mask]]。值返回结果 I n [254]：％timeit using_loop（） 100个循环，最好3个循环：每个循环7.74 ms 在[255]中：％timeit using_searchsorted（） 1000个循环，最好3：每循环1.56毫秒在[256]：7.74 / 1.56 出[256]：4.961538461538462 I'm wondering how I can speed up a merge of two dataframes. One of the dataframes has time stamped data points (value col). import pandas as pd import numpy as np data = pd.DataFrame({'time':np.sort(np.random.uniform(0,100,size=50)), 'value':np.random.uniform(-1,1,size=50)}) The other has time interval information (start_time, end_time, and associated interval_id). intervals = pd.DataFrame({'interval_id':np.arange(9), 'start_time':np.random.uniform(0,5,size=9) + np.arange(0,90,10), 'end_time':np.random.uniform(5,10,size=9) + np.arange(0,90,10)}) I'd like to merge these two dataframes more efficiently than the for loop below: data['interval_id'] = np.nan for index, ser in intervals.iterrows(): in_interval = (data['time'] >= ser['start_time']) & \ (data['time'] <= ser['end_time']) data['interval_id'][in_interval] = ser['interval_id'] result = data.merge(intervals, how='outer').sort('time').reset_index(drop=True) I keep imagining I'll be able to use pandas time series functionality, like a date range or TimeGrouper, but I have yet to figure out anything more pythonic (pandas-y?) than the above. Example result: time value interval_id start_time end_time 0 0.575976 0.022727 NaN NaN NaN 1 4.607545 0.222568 0 3.618715 8.294847 2 5.179350 0.438052 0 3.618715 8.294847 3 11.069956 0.641269 1 10.301728 19.870283 4 12.387854 0.344192 1 10.301728 19.870283 5 18.889691 0.582946 1 10.301728 19.870283 6 20.850469 -0.027436 NaN NaN NaN 7 23.199618 0.731316 2 21.488868 28.968338 8 26.631284 0.570647 2 21.488868 28.968338 9 26.996397 0.597035 2 21.488868 28.968338 10 28.601867 -0.131712 2 21.488868 28.968338 11 28.660986 0.710856 2 21.488868 28.968338 12 28.875395 -0.355208 2 21.488868 28.968338 13 28.959320 -0.430759 2 21.488868 28.968338 14 29.702800 -0.554742 NaN NaN NaN Any suggestions from time series-savvy people out there would be greatly appreciated. Update, after Jeff's answer: The main problem is that interval_id has no relation to any regular time interval (e.g., intervals are not always approximately 10 seconds). One interval could be 10 seconds, the next could be 2 seconds, and the next could be 100 seconds, so I can't use any regular rounding scheme as Jeff proposed. Unfortunately, my minimal example above does not make that clear. 解决方案 You could use np.searchsorted to find the indices representing where each value in data['time'] would fit between intervals['start_time']. Then you could call np.searchsorted again to find the indices representing where each value in data['time'] would fit between intervals['end_time']. Note that using np.searchsorted relies on interval['start_time'] and interval['end_time'] being in sorted order. For each corresponding location in the arrays, where these two indices are equal, data['time'] fits in between interval['start_time'] and interval['end_time']. Note that this relies on the intervals being disjoint. Using searchsorted in this way is about 5 times faster than using the for-loop: import pandas as pd import numpy as np np.random.seed(1) data = pd.DataFrame({'time':np.sort(np.random.uniform(0,100,size=50)), 'value':np.random.uniform(-1,1,size=50)}) intervals = pd.DataFrame( {'interval_id':np.arange(9), 'start_time':np.random.uniform(0,5,size=9) + np.arange(0,90,10), 'end_time':np.random.uniform(5,10,size=9) + np.arange(0,90,10)}) def using_loop(): data['interval_id'] = np.nan for index, ser in intervals.iterrows(): in_interval = (data['time'] >= ser['start_time']) & \ (data['time'] <= ser['end_time']) data['interval_id'][in_interval] = ser['interval_id'] result = data.merge(intervals, how='outer').sort('time').reset_index(drop=True) return result def using_searchsorted(): start_idx = np.searchsorted(intervals['start_time'].values, data['time'].values)-1 end_idx = np.searchsorted(intervals['end_time'].values, data['time'].values) mask = (start_idx == end_idx) result = data.copy() result['interval_id'] = result['start_time'] = result['end_time'] = np.nan result['interval_id'][mask] = start_idx result.ix[mask, 'start_time'] = intervals['start_time'][start_idx[mask]].values result.ix[mask, 'end_time'] = intervals['end_time'][end_idx[mask]].values return result In [254]: %timeit using_loop() 100 loops, best of 3: 7.74 ms per loop In [255]: %timeit using_searchsorted() 1000 loops, best of 3: 1.56 ms per loop In [256]: 7.74/1.56 Out[256]: 4.961538461538462 这篇关于基于不规则的时间间隔合并 pandas 数据帧的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

基于不规则的时间间隔合并 pandas 数据帧 [英] Merge pandas DataFrames based on irregular time intervals

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

基于不规则的时间间隔合并 pandas 数据帧 [英] Merge pandas DataFrames based on irregular time intervals

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭