合并hdf5文件 [英] Combining hdf5 files
问题描述
我有许多hdf5文件,每个文件都有一个数据集.数据集太大,无法保存在RAM中.我想将这些文件合并为一个单独包含所有数据集的文件(即不将这些数据集串联为一个数据集).
执行此操作的一种方法是创建hdf5文件,然后一个一个地复制数据集.这将很慢并且很复杂,因为它将需要被缓冲的副本.
有没有更简单的方法可以做到这一点?似乎应该存在,因为它实际上只是在创建一个容器文件.
我正在使用python/h5py.
一种解决方案是将h5py
接口用于低级H5Ocopy
功能:
In [1]: import h5py as h5
In [2]: hf1 = h5.File("f1.h5")
In [3]: hf2 = h5.File("f2.h5")
In [4]: hf1.create_dataset("val", data=35)
Out[4]: <HDF5 dataset "val": shape (), type "<i8">
In [5]: hf1.create_group("g1")
Out[5]: <HDF5 group "/g1" (0 members)>
In [6]: hf1.get("g1").create_dataset("val2", data="Thing")
Out[6]: <HDF5 dataset "val2": shape (), type "|O8">
In [7]: hf1.flush()
In [8]: h5.h5o.copy(hf1.id, "g1", hf2.id, "newg1")
In [9]: h5.h5o.copy(hf1.id, "val", hf2.id, "newval")
In [10]: hf2.values()
Out[10]: [<HDF5 group "/newg1" (1 members)>, <HDF5 dataset "newval": shape (), type "<i8">]
In [11]: hf2.get("newval").value
Out[11]: 35
In [12]: hf2.get("newg1").values()
Out[12]: [<HDF5 dataset "val2": shape (), type "|O8">]
In [13]: hf2.get("newg1").get("val2").value
Out[13]: 'Thing'
上面的内容是由Debian Wheezy或多或少的原始版本安装的,其中h5py
版本2.0.1-2+b1
和iPython版本0.13.1-2+deb7u1
在Python版本2.7.3-4+deb7u1
之上.在执行上述操作之前,文件f1.h5
和f2.h5
不存在. 请注意,根据 ),不是 str
.
命令[7]
中的hf1.flush()
是至关重要的,因为低级接口显然总是从存储在磁盘上的.h5
文件的版本中提取,而不是从缓存在内存中的文件中提取.可以通过使用hf1.get("g1").id
提供该组的ID,从而实现将数据集复制到不在File
根之外的组中或从中复制数据集.
请注意,如果目标位置中已经存在具有指定名称的对象,则h5py.h5o.copy
将会失败,并发生异常(无故障).
I have a number of hdf5 files, each of which have a single dataset. The datasets are too large to hold in RAM. I would like to combine these files into a single file containing all datasets separately (i.e. not to concatenate the datasets into a single dataset).
One way to do this is to create a hdf5 file and then copy the datasets one by one. This will be slow and complicated because it will need to be buffered copy.
Is there a more simple way to do this? Seems like there should be, since it is essentially just creating a container file.
I am using python/h5py.
One solution is to use the h5py
interface to the low-level H5Ocopy
function of the HDF5 API, in particular the h5py.h5o.copy
function:
In [1]: import h5py as h5
In [2]: hf1 = h5.File("f1.h5")
In [3]: hf2 = h5.File("f2.h5")
In [4]: hf1.create_dataset("val", data=35)
Out[4]: <HDF5 dataset "val": shape (), type "<i8">
In [5]: hf1.create_group("g1")
Out[5]: <HDF5 group "/g1" (0 members)>
In [6]: hf1.get("g1").create_dataset("val2", data="Thing")
Out[6]: <HDF5 dataset "val2": shape (), type "|O8">
In [7]: hf1.flush()
In [8]: h5.h5o.copy(hf1.id, "g1", hf2.id, "newg1")
In [9]: h5.h5o.copy(hf1.id, "val", hf2.id, "newval")
In [10]: hf2.values()
Out[10]: [<HDF5 group "/newg1" (1 members)>, <HDF5 dataset "newval": shape (), type "<i8">]
In [11]: hf2.get("newval").value
Out[11]: 35
In [12]: hf2.get("newg1").values()
Out[12]: [<HDF5 dataset "val2": shape (), type "|O8">]
In [13]: hf2.get("newg1").get("val2").value
Out[13]: 'Thing'
The above was generated with h5py
version 2.0.1-2+b1
and iPython version 0.13.1-2+deb7u1
atop Python version 2.7.3-4+deb7u1
from a more-or-less vanilla install of Debian Wheezy. The files f1.h5
and f2.h5
did not exist prior to executing the above. Note that, per salotz, for Python 3 the dataset/group names need to be bytes
(e.g., b"val"
), not str
.
The hf1.flush()
in command [7]
is crucial, as the low-level interface apparently will always draw from the version of the .h5
file stored on disk, not that cached in memory. Copying datasets to/from groups not at the root of a File
can be achieved by supplying the ID of that group using, e.g., hf1.get("g1").id
.
Note that h5py.h5o.copy
will fail with an exception (no clobber) if an object of the indicated name already exists in the destination location.
这篇关于合并hdf5文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!