我应该如何使用h5py库存储时间序列数据? [英] How should I use the h5py library for storing time series data?

查看:184
本文介绍了我应该如何使用h5py库存储时间序列数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些时间序列数据,我以前使用pytables将它们存储为hdf5文件.我最近尝试用h5py lib存储相同的内容.但是,由于numpy数组的所有元素都必须具有相同的dtype,因此在使用h5py lib存储日期之前,我必须将日期(通常是索引)转换为"float64"类型. 当我使用pytables时,保留了索引及其dtype,这使我可以查询时间序列,而无需将其全部拉到内存中.我想用h5py是不可能的.我在这里缺少什么吗?并且,如果没有,在什么情况下应该使用h5py lib来存储时间序列数据?我问这个问题的原因,在此方面的清晰度可以帮助我设计一个更高效的(明智的处理和存储)项目.

I have some time series data that i previously stored as hdf5 files using pytables. I recently tried storing the same with h5py lib. However, since all elements of numpy array have to be of same dtype, I have to convert the date (which is usually the index) into 'float64' type before storing it using h5py lib. When I use pytables, the index and its dtype are preserved which makes it possible for me to query the time-series without the need of pulling it all in the memory. I guess with h5py that is not possible. am I missing something here? And if not, under what situations should i use h5py lib to store time series data? I ask this question cause, clarity on this could help me design a more efficient (processing & storage wise) project.

下面是简单的代码,在这里我必须丢失索引信息才能将其存储为单个dtype对象

below is simple code, where I have to lose index information in order to store it as a single dtype object

dt_range = pd.date_range('2016-12-01','2016-12-10')
data = np.arange(0,20).reshape(-1,2)
df = pd.DataFrame(data,index = dt_range, columns = list('ab'), dtype = 'float')
df.index  = df.index.to_julian_date()
df = df.reset_index()
h = h5py.File(r'path\temp.h5', 'w')
dset = h.create_dataset('temp',data = df.values, shape = (10,3))

推荐答案

当我运行@piRSquared代码并使用h5py查看文件时,我看到:

When I run @piRSquared code, and look at the file with h5py I see:

In [4]: import h5py
In [5]: f=h5py.File('temp.h5')

In [8]: list(f.keys())
Out[8]: ['temp']
In [9]: f['temp']
Out[9]: <HDF5 group "/temp" (4 members)>
In [10]: list(f['temp'].keys())
Out[10]: ['axis0', 'axis1', 'block0_items', 'block0_values']

In [11]: f['temp']['axis0'][:]
Out[11]: 
array([b'index', b'a', b'b'], 
      dtype='|S5')
In [12]: f['temp']['axis1'][:]
Out[12]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int64)
In [13]: f['temp']['block0_items'][:]
Out[13]: 
array([b'index', b'a', b'b'], 
      dtype='|S5')
In [14]: f['temp']['block0_values'][:]
Out[14]: 
array([[  2.45772350e+06,   0.00000000e+00,   1.00000000e+00],
       [  2.45772450e+06,   2.00000000e+00,   3.00000000e+00],
       [  2.45772550e+06,   4.00000000e+00,   5.00000000e+00],
       [  2.45772650e+06,   6.00000000e+00,   7.00000000e+00],
       [  2.45772750e+06,   8.00000000e+00,   9.00000000e+00],
       [  2.45772850e+06,   1.00000000e+01,   1.10000000e+01],
       [  2.45772950e+06,   1.20000000e+01,   1.30000000e+01],
       [  2.45773050e+06,   1.40000000e+01,   1.50000000e+01],
       [  2.45773150e+06,   1.60000000e+01,   1.70000000e+01],
       [  2.45773250e+06,   1.80000000e+01,   1.90000000e+01]])

因此,它已将索引信息保存为3个系列,并将值保存在另一个系列中,该值将作为2d numpy数组加载.

So it has saved the indexing information in 3 series, and the values in another, which loads as a 2d numpy array.

这是我希望从pytables创建的文件中看到的信息.

That's the same kind of information that I'd expect to see from a file created by pytables.

根据其文档,pd.HDFStore正在使用pytables.

According to it's documentation, pd.HDFStore is using pytables.

这篇关于我应该如何使用h5py库存储时间序列数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆