快速有效的从HDF5文件序列化和检索大量numpy数组的方法 [英] Fast and efficient way of serializing and retrieving a large number of numpy arrays from HDF5 file

查看：375 发布时间：2020/5/18 23:24:40 python numpy hdf5 h5py numpy-ndarray

本文介绍了快速有效的从HDF5文件序列化和检索大量numpy数组的方法的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有大量的numpy数组，特别是113287，其中每个数组的形状都是36 x 2048.就内存而言，这等于 32 Gigabytes .

I have a huge list of numpy arrays, specifically 113287, where each array is of shape 36 x 2048. In terms of memory, this amounts to 32 Gigabytes.

到目前为止，我已经将这些阵列序列化为一个巨型HDF5文件.现在的问题是，从此hdf5文件中检索单个阵列每次访问都花费很长时间(不到10分钟).

As of now, I have serialized these arrays as a giant HDF5 file. Now, the problem is that retrieving individual arrays from this hdf5 file takes excruciatingly long time (north of 10 mins) for each access.

我该如何加快速度?这对于我的实现非常重要，因为我必须索引该列表数千次才能馈入深度神经网络.

How can I speed this up? This is very important for my implementation since I have to index into this list several thousand times for feeding into Deep Neural Networks.

这是我索引到hdf5文件的方式:

Here's how I index into hdf5 file:

In [1]: import h5py
In [2]: hf = h5py.File('train_ids.hdf5', 'r')

In [5]: list(hf.keys())[0]
Out[5]: 'img_feats'

In [6]: group_key = list(hf.keys())[0]

In [7]: hf[group_key]
Out[7]: <HDF5 dataset "img_feats": shape (113287, 36, 2048), type "<f4">


# this is where it takes very very long time
In [8]: list(hf[group_key])[-1].shape
Out[8]: (36, 2048)

有什么想法可以加快速度吗?还有其他方法可以序列化这些数组以加快访问速度吗?

Any ideas where I can speed things up? Is there any other way of serializing these arrays for faster access?

注意:我使用的是Python列表，因为我希望保留顺序(即以与创建hdf5文件时的顺序相同的顺序进行检索)

Note: I'm using a Python list since I want the order to be preserved (i.e. to retrieve in the same order as I put it when I created the hdf5 file)

推荐答案

根据Out[7]，"img_feats"是一个大型3d数组. (113287，36，2048)形状.

According to Out[7], "img_feats" is a large 3d array. (113287, 36, 2048) shape.

将ds定义为数据集(不加载任何内容):

Define ds as the dataset (doesn't load anything):

ds = hf[group_key]

x = ds[0]    # should be a (36, 2048) array

arr = ds[:]   # should load the whole dataset into memory.
arr = ds[:n]   # load a subset, slice

根据 h5py-reading-writing-data :

HDF5数据集重新使用NumPy切片语法读取和写入文件.切片规范直接转换为HDF5"hyperlab"选择，并且是快速有效的访问文件中数据的方式.

我看不到将其包装在list()中的任何意义；也就是说，将3d数组拆分为113287个2d数组的列表. HDF5文件上的3d数据集与numpy数组之间存在清晰的映射.

I don't see any point in wrapping that in list(); that is, in splitting the 3d array in a list of 113287 2d arrays. There's a clean mapping between 3d datasets on the HDF5 file and numpy arrays.

h5py-fancy-indexing 警告数据集的奇特索引更慢.也就是说，试图加载该大型数据集的[1，1000，3000，6000]个子数组.

h5py-fancy-indexing warns that fancy indexing of a dataset is slower. That is, seeking to load, say [1, 1000, 3000, 6000] subarrays of that large dataset.

如果使用这么大的数据集过于混乱，您可能想尝试编写和读取一些较小的数据集.

You might want to experiment with writing and reading some smaller datasets if working with this large one is too confusing.

这篇关于快速有效的从HDF5文件序列化和检索大量numpy数组的方法的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

快速有效的从HDF5文件序列化和检索大量numpy数组的方法 [英] Fast and efficient way of serializing and retrieving a large number of numpy arrays from HDF5 file

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

快速有效的从HDF5文件序列化和检索大量numpy数组的方法 [英] Fast and efficient way of serializing and retrieving a large number of numpy arrays from HDF5 file

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭