Python-快速HDF5时间序列数据查询 [英] Python - Fast HDF5 Time Series Data Queries

查看:423
本文介绍了Python-快速HDF5时间序列数据查询的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要对HDF5数据库中特定时间范围内的时间序列数据进行大量连续查询(数据以秒为单位,并不总是连续的",我只知道开始和结束时间).因此,我想知道是否有比我当前代码更快的解决方案,它受

I need to do a lot of successive queries on time series data in specific time spans from a HDF5 database (the data is stord in seconds, not always "continuous", I only know the start and end time). Therefore, I wonder wether there is a faster solution than my current code, which was inspired by this answer:

import pandas as pd
from pandas import HDFStore

store = HDFStore(pathToStore)
dates = pd.date_range(start=start_date,end=end_date, freq='S')
index = store.select_column('XAU','index')
ts    = store.select('XAU', where=index[index.isin(dates)].index)

任何评论和建议,不胜感激!

Any comments and suggestions are highly appreciated, thx!

推荐答案

让我们对其进行测试!

生成1M行DF:

In [129]: df = pd.DataFrame({'val':np.random.rand(10**6)}, index=pd.date_range('1980-01-01', freq='19S', periods=10**6))

In [130]: df.shape
Out[130]: (1000000, 1)

In [131]: df.head()
Out[131]:
                          val
1980-01-01 00:00:00  0.388980
1980-01-01 00:00:19  0.916917
1980-01-01 00:00:38  0.894360
1980-01-01 00:00:57  0.235797
1980-01-01 00:01:16  0.577791

我们洗牌吧

In [132]: df = df.sample(frac=1)

In [133]: df.head()
Out[133]:
                          val
1980-07-04 12:10:11  0.898648
1980-07-08 20:37:39  0.563325
1980-03-10 00:06:12  0.449458
1980-08-07 02:01:42  0.511847
1980-02-28 21:09:43  0.757327

将生成的DF存储到HDF5文件中(注意:默认情况下,仅索引了索引,因此,如果还要按其他列搜索,请使用data_columns参数):

Storing generated DF into HDF5 file (NOTE: per default only index is indexed, so if you are going to search also by other columns, use data_columns parameter):

In [134]: store = pd.HDFStore('d:/temp/test_time_ser.h5')

In [135]: store.append('XAU', df, format='t')

In [136]: store.close()

In [140]: store = pd.HDFStore('d:/temp/test_time_ser.h5')

让我们测试select(where="<query>")方法:

In [141]: store.select('XAU', where="index >= '1980-04-04' and index<= '1980-05-01'").head()
Out[141]:
                          val
1980-04-13 07:22:05  0.391409
1980-04-25 14:23:07  0.400838
1980-04-10 12:32:08  0.136346
1980-04-09 18:58:35  0.944389
1980-04-13 22:34:05  0.115643

测量效果:

In [142]: %timeit store.select('XAU', where="index >= '1980-04-04' and index<= '1980-05-01'")
1 loop, best of 3: 755 ms per loop

让我们将其与您当前的方法进行比较:

Let's compare it with your current approach:

In [144]: dates = pd.date_range(start='1980-04-04',end='1980-05-01', freq='S')

In [145]: index = store.select_column('XAU','index')

In [146]: store.select('XAU', where=index[index.isin(dates)].index).head()
Out[146]:
                          val
1980-04-13 07:22:05  0.391409
1980-04-25 14:23:07  0.400838
1980-04-10 12:32:08  0.136346
1980-04-09 18:58:35  0.944389
1980-04-13 22:34:05  0.115643

In [147]: %timeit store.select('XAU', where=index[index.isin(dates)].index)
1 loop, best of 3: 8.13 s per loop

更新:让我们做同样的测试,但是这次假设索引(时间序列)已排序:

UPDATE: let's do the same test, but this time assuming that the index (time series) is sorted:

In [156]: df = pd.DataFrame({'val':np.random.rand(10**6)}, index=pd.date_range('1980-01-01', freq='19S', periods=10**6))

In [157]: df.shape
Out[157]: (1000000, 1)

In [164]: store.close()

In [165]: store = pd.HDFStore('d:/temp/test_time_ser2.h5')

In [166]: store.append('XAU', df, format='t')

In [167]: %timeit store.select('XAU', where="index >= '1980-04-04' and index<= '1980-05-01'")
1 loop, best of 3: 253 ms per loop

In [168]: %timeit store.select('XAU', where=index[index.isin(dates)].index)
1 loop, best of 3: 8.13 s per loop

这篇关于Python-快速HDF5时间序列数据查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆