如何将 pandas 数据框写入HDF5数据集 [英] How to write a Pandas Dataframe into a HDF5 dataset

查看:113
本文介绍了如何将 pandas 数据框写入HDF5数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将Pandas数据帧中的数据写入嵌套的hdf5文件中,每个组中包含多个组和数据集.我想将其保存为单个文件,以后每天都会增加.我使用了以下代码,该代码显示了我想要实现的结构

I'm trying to write data from a Pandas dataframe into a nested hdf5 file, with multiple groups and datasets within each group. I'd like to keep it as a single file which will grow in the future on a daily basis. I've had a go with the following code, which shows the structure of what I'd like to achieve

import h5py
import numpy as np
import pandas as pd

file = h5py.File('database.h5','w')

d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
     'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d) 

groups = ['A','B','C']         

for m in groups:

    group = file.create_group(m)
    dataset = ['1','2','3']

    for n in dataset:

        data = df
        ds = group.create_dataset(m + n, data.shape)
        print ("Dataset dataspace is", ds.shape)
        print ("Dataset Numpy datatype is", ds.dtype)
        print ("Dataset name is", ds.name)
        print ("Dataset is a member of the group", ds.parent)
        print ("Dataset was created in the file", ds.file)

        print ("Writing data...")
        ds[...] = data        

        print ("Reading data back...")
        data_read = ds[...]

        print ("Printing data...")
        print (data_read)

file.close(

)

这样,就可以创建嵌套结构,但是会丢失索引和列.我已经尝试过

This way the nested structure is created but it loses the index and columns. I've tried the

df.to_hdf('database.h5', ds, table=True, mode='a')

但是没有用,我得到了这个错误

but didn't work, I get this error

AttributeError:数据集"对象没有属性拆分"

AttributeError: 'Dataset' object has no attribute 'split'

任何人都可以给我一些启示.非常感谢

Can anyone shed some light please. Many thanks

推荐答案

我认为可以使用pandas \ pytables和HDFStore类代替h5py.所以我尝试了以下

I thought to have a go with pandas\pytables and the HDFStore class instead of h5py. So I tried the following

import numpy as np
import pandas as pd

db = pd.HDFStore('Database.h5')

index = pd.date_range('1/1/2000', periods=8)

df = pd.DataFrame(np.random.randn(8, 3), index=index, columns=['Col1', 'Col2', 'Col3'])

groups = ['A','B','C']     

i = 1    

for m in groups:

    subgroups = ['d','e','f']

    for n in subgroups:

        db.put(m + '/' + n, df, format = 'table', data_columns = True)

它的工作原理是,从A/d到C/f创建了9个组(组而不是pyatbles中的数据集,而不是fo h5py?).列和索引得以保留,并且可以执行我需要的数据框操作.仍然想知道这是否是从特定组中检索数据的有效方法,这种方法将来会变得越来越庞大,例如

It works, 9 groups (groups instead of datasets in pyatbles instead fo h5py?) created from A/d to C/f. Columns and indexes preserved and can do the dataframe operations I need. Still wondering though whether this is an efficient way to retrieve data from a specific group which will become huge in the the future i.e. operations like

db['A/d'].Col1[4:]

这篇关于如何将 pandas 数据框写入HDF5数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆