基于不规则的时间间隔合并 pandas 数据帧 [英] Merge pandas DataFrames based on irregular time intervals
问题描述
value
col)。 import pandas as pd
import numpy as np
data = pd.DataFrame({'time':np.sort(np.random.uniform(0,100,size = 50)),
'value':np.random.uniform(-1,1,size = 50)})
另一个包含时间间隔信息( start_time
, end_time
)和关联的 interval_id
)。
intervals = pd.DataFrame({'interval_id':np.arange(9) ,
'start_time':np.random.uniform(0,5,size = 9)+ np.arange(0,90,10),
'end_time':np.random.uniform(5 ,10,size = 9)+ np.arange(0,90,10)})
I 'd想要比以下循环更有效地合并这两个数据框:
data ['interval_id'] = np.nan
for index,ser in intervals.iterrows():
in_interval =(data ['time']> = ser ['start_time'])& \
(data ['time']< = ser ['end_time'])
data ['interval_id'] [in_interval] = ser ['interval_id']
result = data.merge(intervals,how ='outer')。sort('time')。reset_index(drop = True)
我一直在想象我可以使用pandas 时间序列功能,如日期范围或TimeGrouper,但我还没有找到比上述更多pythonic(pandas-y?)。
示例结果:
时间值interval_id start_time end_time
0 0.575976 0.022727 NaN NaN $
1 4.607545 0.222568 0 3.618715 8.294847
2 5.179350 0.438052 0 3.618715 8.294847
3 11.069956 0.641269 1 10.301728 19.870283
4 12.387854 0.344192 1 10.301728 19.870283
5 18.889691 0.582946 1 10.301728 19.870283
6 20.850469 -0.027436 NaN NaN NaN
7 23.199618 0.731316 2 21.488868 28.968338
8 26.631284 0.570647 2 21.488868 28.968338
9 26.996397 0.597035 2 21.488868 28.968338
10 28.601867 -0.131712 2 21.488868 28.968338
11 28.660986 0.710856 2 21.488868 28.968338
12 28.875395 -0.355208 2 21.488868 28.968338
13 28.959320 -0.430759 2 21.488868 28.968338
14 29.702800 -0.554742 NaN NaN NaN
任何时间精明的人都会非常感激您的建议。
更新后,杰夫的答案:
主要问题是 interval_id
与任何常规时间间隔无关(例如,间隔不总是大约10秒) DS)。一个间隔可能是10秒,接下来可能是2秒,接下来可能是100秒,所以我不能像杰夫提出的那样使用任何常规舍入方案。不幸的是,上面我最小的例子并没有说清楚。 你可以使用 np.searchsorted 来查找代表 data ['时间']
将适合间隔['start_time']
。然后,您可以再次调用 np.searchsorted
来查找代表 data ['time']
中每个值的位置适合间隔['end_time']
之间。请注意,使用 np.searchsorted
依赖于 interval ['start_time']
和 interval ['
对于数组中的每个对应位置,这两个索引相等时, data ['time']
适用于 interval ['start_time']
和 interval ['end_time']
。请注意,这依赖于不相交的时间间隔。
以这种方式使用 searchsorted
大约比使用 for-loop
:
将pandas导入为pd
导入numpy为np
np.random.seed(1)
data = pd.DataFrame({'time':np.sort(np.random.uniform(0,100,size = 50) )),
'value':np.random.uniform(-1,1,size = 50)})
intervals = pd.DataFrame(
{'interval_id' :np.arange(9),
'start_time':np.random.uniform(0,5,size = 9)+ np.arange(0,90,10),
'end_time': np.random.uniform(5,10,size = 9)+ np.arange(0,90,10)})
def using_loop():
data ['interval_id'] = np.nan
for index,ser in intervals.iterrows():
in_interval =(data ['time']> = ser ['start_time'])& \
(data ['time']< = ser ['end_time'])
data ['interval_id'] [in_interval] = ser ['interval_id']
结果= data.merge(间隔,how ='outer')。sort('time')。reset_index(drop = True)
返回结果
$ b $ def using_searchsorted():
start_idx = np.searchsorted(intervals ['start_time']。values,data ['time']。values)-1
end_idx = np.searchsorted(intervals ['end_time']。values,data [时间']。值)
掩码=(start_idx == end_idx)
结果= data.copy()
结果['interval_id'] =结果['start_time'] =结果['' end_time'] = np.nan
result ['interval_id'] [mask] = start_idx
result.ix [mask,'start_time'] =间隔['start_time'] [start_idx [mask]]。值
result.ix [mask,'end_time'] =间隔['end_time'] [end_idx [mask]]。值
返回结果
I n [254]:%timeit using_loop()
100个循环,最好3个循环:每个循环7.74 ms
在[255]中:%timeit using_searchsorted()
1000个循环,最好3:每循环1.56毫秒
在[256]:7.74 / 1.56
出[256]:4.961538461538462
I'm wondering how I can speed up a merge of two dataframes. One of the dataframes has time stamped data points (value
col).
import pandas as pd
import numpy as np
data = pd.DataFrame({'time':np.sort(np.random.uniform(0,100,size=50)),
'value':np.random.uniform(-1,1,size=50)})
The other has time interval information (start_time
, end_time
, and associated interval_id
).
intervals = pd.DataFrame({'interval_id':np.arange(9),
'start_time':np.random.uniform(0,5,size=9) + np.arange(0,90,10),
'end_time':np.random.uniform(5,10,size=9) + np.arange(0,90,10)})
I'd like to merge these two dataframes more efficiently than the for
loop below:
data['interval_id'] = np.nan
for index, ser in intervals.iterrows():
in_interval = (data['time'] >= ser['start_time']) & \
(data['time'] <= ser['end_time'])
data['interval_id'][in_interval] = ser['interval_id']
result = data.merge(intervals, how='outer').sort('time').reset_index(drop=True)
I keep imagining I'll be able to use pandas time series functionality, like a date range or TimeGrouper, but I have yet to figure out anything more pythonic (pandas-y?) than the above.
Example result:
time value interval_id start_time end_time
0 0.575976 0.022727 NaN NaN NaN
1 4.607545 0.222568 0 3.618715 8.294847
2 5.179350 0.438052 0 3.618715 8.294847
3 11.069956 0.641269 1 10.301728 19.870283
4 12.387854 0.344192 1 10.301728 19.870283
5 18.889691 0.582946 1 10.301728 19.870283
6 20.850469 -0.027436 NaN NaN NaN
7 23.199618 0.731316 2 21.488868 28.968338
8 26.631284 0.570647 2 21.488868 28.968338
9 26.996397 0.597035 2 21.488868 28.968338
10 28.601867 -0.131712 2 21.488868 28.968338
11 28.660986 0.710856 2 21.488868 28.968338
12 28.875395 -0.355208 2 21.488868 28.968338
13 28.959320 -0.430759 2 21.488868 28.968338
14 29.702800 -0.554742 NaN NaN NaN
Any suggestions from time series-savvy people out there would be greatly appreciated.
Update, after Jeff's answer:
The main problem is that interval_id
has no relation to any regular time interval (e.g., intervals are not always approximately 10 seconds). One interval could be 10 seconds, the next could be 2 seconds, and the next could be 100 seconds, so I can't use any regular rounding scheme as Jeff proposed. Unfortunately, my minimal example above does not make that clear.
You could use np.searchsorted to find the indices representing where each value in data['time']
would fit between intervals['start_time']
. Then you could call np.searchsorted
again to find the indices representing where each value in data['time']
would fit between intervals['end_time']
. Note that using np.searchsorted
relies on interval['start_time']
and interval['end_time']
being in sorted order.
For each corresponding location in the arrays, where these two indices are equal, data['time']
fits in between interval['start_time']
and interval['end_time']
. Note that this relies on the intervals being disjoint.
Using searchsorted
in this way is about 5 times faster than using the for-loop
:
import pandas as pd
import numpy as np
np.random.seed(1)
data = pd.DataFrame({'time':np.sort(np.random.uniform(0,100,size=50)),
'value':np.random.uniform(-1,1,size=50)})
intervals = pd.DataFrame(
{'interval_id':np.arange(9),
'start_time':np.random.uniform(0,5,size=9) + np.arange(0,90,10),
'end_time':np.random.uniform(5,10,size=9) + np.arange(0,90,10)})
def using_loop():
data['interval_id'] = np.nan
for index, ser in intervals.iterrows():
in_interval = (data['time'] >= ser['start_time']) & \
(data['time'] <= ser['end_time'])
data['interval_id'][in_interval] = ser['interval_id']
result = data.merge(intervals, how='outer').sort('time').reset_index(drop=True)
return result
def using_searchsorted():
start_idx = np.searchsorted(intervals['start_time'].values, data['time'].values)-1
end_idx = np.searchsorted(intervals['end_time'].values, data['time'].values)
mask = (start_idx == end_idx)
result = data.copy()
result['interval_id'] = result['start_time'] = result['end_time'] = np.nan
result['interval_id'][mask] = start_idx
result.ix[mask, 'start_time'] = intervals['start_time'][start_idx[mask]].values
result.ix[mask, 'end_time'] = intervals['end_time'][end_idx[mask]].values
return result
In [254]: %timeit using_loop()
100 loops, best of 3: 7.74 ms per loop
In [255]: %timeit using_searchsorted()
1000 loops, best of 3: 1.56 ms per loop
In [256]: 7.74/1.56
Out[256]: 4.961538461538462
这篇关于基于不规则的时间间隔合并 pandas 数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!