将HDF5文件读入numpy数组 [英] Read HDF5 file into numpy array
问题描述
我有以下代码将hdf5文件读取为numpy数组:
I have the following code to read a hdf5 file as a numpy array:
hf = h5py.File('path/to/file', 'r')
n1 = hf.get('dataset_name')
n2 = np.array(n1)
,当我打印n2
时,我得到了:
and when I print n2
I get this:
Out[15]:
array([[<HDF5 object reference>, <HDF5 object reference>,
<HDF5 object reference>, <HDF5 object reference>...
如何读取HDF5 object reference
以查看其中存储的数据?
How can I read the HDF5 object reference
to view the data stored in it?
推荐答案
最简单的方法是使用HDF5数据集的.value
属性.
The easiest thing is to use the .value
attribute of the HDF5 dataset.
>>> hf = h5py.File('/path/to/file', 'r')
>>> data = hf.get('dataset_name').value # `data` is now an ndarray.
您还可以对数据集进行切片,从而使用请求的数据生成实际的ndarray:
You can also slice the dataset, which produces an actual ndarray with the requested data:
>>> hf['dataset_name'][:10] # produces ndarray as well
但是请记住,h5py
数据集在许多方面都像ndarray
一样.因此,您可以将数据集本身不变地传递给大多数(如果不是全部)NumPy函数.因此,例如,这很好用:np.mean(hf.get('dataset_name'))
.
But keep in mind that in many ways the h5py
dataset acts like an ndarray
. So you can pass the dataset itself unchanged to most, if not all, NumPy functions. So, for example, this works just fine: np.mean(hf.get('dataset_name'))
.
本来我误解了这个问题.问题不在于加载数字数据,而是数据集实际上包含 HDF5参考.这是一个奇怪的设置,在h5py
中阅读时有点尴尬.您需要取消引用数据集中的每个引用.我将仅显示其中之一.
I misunderstood the question originally. The problem isn't loading the numerical data, it's that the dataset actually contains HDF5 references. This is a strange setup, and it's kind of awkward to read in h5py
. You need to dereference each reference in the dataset. I'll show it for just one of them.
首先,让我们创建一个文件和一个临时数据集:
First, let's create a file and a temporary dataset:
>>> f = h5py.File('tmp.h5', 'w')
>>> ds = f.create_dataset('data', data=np.zeros(10,))
接下来,创建对其的引用,并将其中一些存储在数据集中.
Next, create a reference to it and store a few of them in a dataset.
>>> ref_dtype = h5py.special_dtype(ref=h5py.Reference)
>>> ref_ds = f.create_dataset('data_refs', data=(ds.ref, ds.ref), dtype=ref_dtype)
然后,您可以通过getting回获取名称,然后从引用的实际数据集中读取这些内容之一.
Then you can read one of these back, in a circuitous way, by getting its name ,and then reading from that actual dataset that is referenced.
>>> name = h5py.h5r.get_name(ref_ds[0], f.id) # 2nd argument is the file identifier
>>> print(name)
b'/data'
>>> out = f[name]
>>> print(out.shape)
(10,)
这是回旋的,但似乎可行. TL; DR是:获取所引用数据集的名称,然后直接从中读取.
It's round-about, but it seems to work. The TL;DR is: get the name of the referenced dataset, and read directly from that.
注意:
尽管有名称,但h5py.h5r.dereference
函数在这里似乎无济于事.它返回被引用对象的ID.可以直接读取,但是在这种情况下非常容易导致崩溃(我在这个人为的示例中做了几次).获得名称并从中读取内容要容易得多.
The h5py.h5r.dereference
function seems pretty unhelpful here, despite the name. It returns the ID of the referenced object. This can be read from directly, but it's very easy to cause a crash in this case (I did it several times in this contrived example here). Getting the name and reading from that is much easier.
这篇关于将HDF5文件读入numpy数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!