将HDF5文件读入numpy数组 [英] Read HDF5 file into numpy array

查看:540
本文介绍了将HDF5文件读入numpy数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下代码将hdf5文件读取为numpy数组:

I have the following code to read a hdf5 file as a numpy array:

hf = h5py.File('path/to/file', 'r')
n1 = hf.get('dataset_name')
n2 = np.array(n1)

,当我打印n2时,我得到了:

and when I print n2 I get this:

Out[15]:
array([[<HDF5 object reference>, <HDF5 object reference>,
        <HDF5 object reference>, <HDF5 object reference>...

如何读取HDF5 object reference以查看其中存储的数据?

How can I read the HDF5 object reference to view the data stored in it?

推荐答案

最简单的方法是使用HDF5数据集的.value属性.

The easiest thing is to use the .value attribute of the HDF5 dataset.

>>> hf = h5py.File('/path/to/file', 'r')
>>> data = hf.get('dataset_name').value # `data` is now an ndarray.

您还可以对数据集进行切片,从而使用请求的数据生成实际的ndarray:

You can also slice the dataset, which produces an actual ndarray with the requested data:

>>> hf['dataset_name'][:10] # produces ndarray as well

但是请记住,h5py数据集在许多方面都像ndarray一样.因此,您可以将数据集本身不变地传递给大多数(如果不是全部)NumPy函数.因此,例如,这很好用:np.mean(hf.get('dataset_name')).

But keep in mind that in many ways the h5py dataset acts like an ndarray. So you can pass the dataset itself unchanged to most, if not all, NumPy functions. So, for example, this works just fine: np.mean(hf.get('dataset_name')).

本来我误解了这个问题.问题不在于加载数字数据,而是数据集实际上包含 HDF5参考.这是一个奇怪的设置,在h5py中阅读时有点尴尬.您需要取消引用数据集中的每个引用.我将仅显示其中之一.

I misunderstood the question originally. The problem isn't loading the numerical data, it's that the dataset actually contains HDF5 references. This is a strange setup, and it's kind of awkward to read in h5py. You need to dereference each reference in the dataset. I'll show it for just one of them.

首先,让我们创建一个文件和一个临时数据集:

First, let's create a file and a temporary dataset:

>>> f = h5py.File('tmp.h5', 'w')
>>> ds = f.create_dataset('data', data=np.zeros(10,))

接下来,创建对其的引用,并将其中一些存储在数据集中.

Next, create a reference to it and store a few of them in a dataset.

>>> ref_dtype = h5py.special_dtype(ref=h5py.Reference)
>>> ref_ds = f.create_dataset('data_refs', data=(ds.ref, ds.ref), dtype=ref_dtype)

然后,您可以通过getting回获取名称,然后从引用的实际数据集中读取这些内容之一.

Then you can read one of these back, in a circuitous way, by getting its name ,and then reading from that actual dataset that is referenced.

>>> name = h5py.h5r.get_name(ref_ds[0], f.id) # 2nd argument is the file identifier
>>> print(name)
b'/data'
>>> out = f[name]
>>> print(out.shape)
(10,)

这是回旋的,但似乎可行. TL; DR是:获取所引用数据集的名称,然后直接从中读取.

It's round-about, but it seems to work. The TL;DR is: get the name of the referenced dataset, and read directly from that.

注意:

尽管有名称,但h5py.h5r.dereference函数在这里似乎无济于事.它返回被引用对象的ID.可以直接读取,但是在这种情况下非常容易导致崩溃(我在这个人为的示例中做了几次).获得名称并从中读取内容要容易得多.

The h5py.h5r.dereference function seems pretty unhelpful here, despite the name. It returns the ID of the referenced object. This can be read from directly, but it's very easy to cause a crash in this case (I did it several times in this contrived example here). Getting the name and reading from that is much easier.

这篇关于将HDF5文件读入numpy数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆