如何使用 h5py 将数据附加到 hdf5 文件中的一个特定数据集 [英] How to append data to one specific dataset in a hdf5 file with h5py

查看:39
本文介绍了如何使用 h5py 将数据附加到 hdf5 文件中的一个特定数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找使用 Python (h5py) 将数据附加到 .h5 文件中的现有数据集的可能性.

I am looking for a possibility to append data to an existing dataset inside a .h5 file using Python (h5py).

我的项目的简短介绍:我尝试使用医学图像数据训练 CNN.由于在将数据转换为 NumPy 数组的过程中数据量巨大且内存使用量大,我需要将转换"拆分为几个数据块:加载和预处理前 100 张医学图像并将 NumPy 数组保存到 hdf5文件,然后加载接下来的 100 个数据集并附加现有的 .h5 文件,依此类推.

A short intro to my project: I try to train a CNN using medical image data. Because of the huge amount of data and heavy memory usage during the transformation of the data to NumPy arrays, I needed to split the "transformation" into a few data chunks: load and preprocess the first 100 medical images and save the NumPy arrays to hdf5 file, then load the next 100 datasets and append the existing .h5 file, and so on.

现在,我尝试按如下方式存储前 100 个转换后的 NumPy 数组:

Now, I tried to store the first 100 transformed NumPy arrays as follows:

import h5py
from LoadIPV import LoadIPV

X_train_data, Y_train_data, X_test_data, Y_test_data = LoadIPV()

with h5py.File('.PreprocessedData.h5', 'w') as hf:
    hf.create_dataset("X_train", data=X_train_data, maxshape=(None, 512, 512, 9))
    hf.create_dataset("X_test", data=X_test_data, maxshape=(None, 512, 512, 9))
    hf.create_dataset("Y_train", data=Y_train_data, maxshape=(None, 512, 512, 1))
    hf.create_dataset("Y_test", data=Y_test_data, maxshape=(None, 512, 512, 1))

可以看出,转换后的 NumPy 数组被分成四个不同的组",分别存储到四个 hdf5 数据集[X_train, X_test, Y_train, Y_test].LoadIPV() 函数执行医学图像数据的预处理.

As can be seen, the transformed NumPy arrays are splitted into four different "groups" that are stored into the four hdf5 datasets[X_train, X_test, Y_train, Y_test]. The LoadIPV() function performs the preprocessing of the medical image data.

我的问题是我想将接下来的 100 个 NumPy 数组存储到同一个 .h5 文件中到现有的数据集中:这意味着我想附加到,例如,现有的X_train 形状为 [100, 512, 512, 9] 的数据集和接下来的 100 个 NumPy 数组,使得 X_train 变成形状 [200, 512, 512, 9].其他三个数据集 X_testY_trainY_test 也应如此.

My problem is that I would like to store the next 100 NumPy arrays into the same .h5 file into the existing datasets: that means that I would like to append to, for example, the existing X_train dataset of shape [100, 512, 512, 9] with the next 100 NumPy arrays, such that X_train becomes of shape [200, 512, 512, 9]. The same should work for the other three datasets X_test, Y_train and Y_test.

推荐答案

我找到了一个似乎有效的解决方案!

I have found a solution that seems to work!

看看这个:使用 h5py 增量写入 hdf5!

为了将数据附加到特定数据集,必须首先在相应轴上调整特定数据集的大小,然后在旧"nparray 的末尾附加新数据.

In order to append data to a specific dataset it is necessary to first resize the specific dataset in the corresponding axis and subsequently append the new data at the end of the "old" nparray.

因此,解决方案如下所示:

Thus, the solution looks like this:

with h5py.File('.PreprocessedData.h5', 'a') as hf:
    hf["X_train"].resize((hf["X_train"].shape[0] + X_train_data.shape[0]), axis = 0)
    hf["X_train"][-X_train_data.shape[0]:] = X_train_data

    hf["X_test"].resize((hf["X_test"].shape[0] + X_test_data.shape[0]), axis = 0)
    hf["X_test"][-X_test_data.shape[0]:] = X_test_data

    hf["Y_train"].resize((hf["Y_train"].shape[0] + Y_train_data.shape[0]), axis = 0)
    hf["Y_train"][-Y_train_data.shape[0]:] = Y_train_data

    hf["Y_test"].resize((hf["Y_test"].shape[0] + Y_test_data.shape[0]), axis = 0)
    hf["Y_test"][-Y_test_data.shape[0]:] = Y_test_data

但是,请注意您应该使用maxshape=(None,)创建数据集,例如

However, note that you should create the dataset with maxshape=(None,), for example

h5f.create_dataset('X_train', data=orig_data, compression="gzip", chunks=True, maxshape=(None,)) 

否则数据集无法扩展.

这篇关于如何使用 h5py 将数据附加到 hdf5 文件中的一个特定数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆