合并hdf5文件 [英] Combining hdf5 files

查看:535
本文介绍了合并hdf5文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有许多hdf5文件,每个文件都有一个数据集.数据集太大,无法保存在RAM中.我想将这些文件合并为一个单独包含所有数据集的文件(即将这些数据集串联为一个数据集).

执行此操作的一种方法是创建hdf5文件,然后一个一个地复制数据集.这将很慢并且很复杂,因为它将需要被缓冲的副本.

有没有更简单的方法可以做到这一点?似乎应该存在,因为它实际上只是在创建一个容器文件.

我正在使用python/h5py.

解决方案

一种解决方案是将h5py接口用于低级H5Ocopy

上面的内容是由Debian Wheezy或多或少的原始版本安装的,其中h5py版本2.0.1-2+b1和iPython版本0.13.1-2+deb7u1在Python版本2.7.3-4+deb7u1之上.在执行上述操作之前,文件f1.h5f2.h5不存在. 请注意,根据 ),不是 str.

命令[7]中的hf1.flush()是至关重要的,因为低级接口显然总是从存储在磁盘上的.h5文件的版本中提取,而不是从缓存在内存中的文件中提取.可以通过使用hf1.get("g1").id提供该组的ID,从而实现将数据集复制到不在File根之外的组中或从中复制数据集.

请注意,如果目标位置中已经存在具有指定名称的对象,则h5py.h5o.copy将会失败,并发生异常(无故障).

I have a number of hdf5 files, each of which have a single dataset. The datasets are too large to hold in RAM. I would like to combine these files into a single file containing all datasets separately (i.e. not to concatenate the datasets into a single dataset).

One way to do this is to create a hdf5 file and then copy the datasets one by one. This will be slow and complicated because it will need to be buffered copy.

Is there a more simple way to do this? Seems like there should be, since it is essentially just creating a container file.

I am using python/h5py.

解决方案

One solution is to use the h5py interface to the low-level H5Ocopy function of the HDF5 API, in particular the h5py.h5o.copy function:

In [1]: import h5py as h5

In [2]: hf1 = h5.File("f1.h5")

In [3]: hf2 = h5.File("f2.h5")

In [4]: hf1.create_dataset("val", data=35)
Out[4]: <HDF5 dataset "val": shape (), type "<i8">

In [5]: hf1.create_group("g1")
Out[5]: <HDF5 group "/g1" (0 members)>

In [6]: hf1.get("g1").create_dataset("val2", data="Thing")
Out[6]: <HDF5 dataset "val2": shape (), type "|O8">

In [7]: hf1.flush()

In [8]: h5.h5o.copy(hf1.id, "g1", hf2.id, "newg1")

In [9]: h5.h5o.copy(hf1.id, "val", hf2.id, "newval")

In [10]: hf2.values()
Out[10]: [<HDF5 group "/newg1" (1 members)>, <HDF5 dataset "newval": shape (), type "<i8">]

In [11]: hf2.get("newval").value
Out[11]: 35

In [12]: hf2.get("newg1").values()
Out[12]: [<HDF5 dataset "val2": shape (), type "|O8">]

In [13]: hf2.get("newg1").get("val2").value
Out[13]: 'Thing'

The above was generated with h5py version 2.0.1-2+b1 and iPython version 0.13.1-2+deb7u1 atop Python version 2.7.3-4+deb7u1 from a more-or-less vanilla install of Debian Wheezy. The files f1.h5 and f2.h5 did not exist prior to executing the above. Note that, per salotz, for Python 3 the dataset/group names need to be bytes (e.g., b"val"), not str.

The hf1.flush() in command [7] is crucial, as the low-level interface apparently will always draw from the version of the .h5 file stored on disk, not that cached in memory. Copying datasets to/from groups not at the root of a File can be achieved by supplying the ID of that group using, e.g., hf1.get("g1").id.

Note that h5py.h5o.copy will fail with an exception (no clobber) if an object of the indicated name already exists in the destination location.

这篇关于合并hdf5文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆