如何使用h5py将数据附加到hdf5文件中的一个特定数据集 [英] How to append data to one specific dataset in a hdf5 file with h5py

查看:810
本文介绍了如何使用h5py将数据附加到hdf5文件中的一个特定数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一种使用Python(h5py)将数据附加到.h5文件内的现有数据集的可能性.

I am looking for a possibility to append data to an existing dataset inside a .h5 file using Python (h5py).

我的项目的简短介绍:我尝试使用医学图像数据来训练CNN.由于在将数据转换为NumPy数组的过程中大量数据和大量内存的使用,我需要将转换"分为几个数据块:加载和预处理前100张医学图像,并将NumPy数组保存到hdf5文件,然后加载接下来的100个数据集并附加现有的.h5文件,依此类推.

A short intro to my project: I try to train a CNN using medical image data. Because of the huge amount of data and heavy memory usage during the transformation of the data to NumPy arrays, I needed to split the "transformation" into a few data chunks: load and preprocess the first 100 medical images and save the NumPy arrays to hdf5 file, then load the next 100 datasets and append the existing .h5 file, and so on.

现在,我尝试按以下方式存储前100个转换的NumPy数组:

Now, I tried to store the first 100 transformed NumPy arrays as follows:

import h5py
from LoadIPV import LoadIPV

X_train_data, Y_train_data, X_test_data, Y_test_data = LoadIPV()

with h5py.File('.\PreprocessedData.h5', 'w') as hf:
    hf.create_dataset("X_train", data=X_train_data, maxshape=(None, 512, 512, 9))
    hf.create_dataset("X_test", data=X_test_data, maxshape=(None, 512, 512, 9))
    hf.create_dataset("Y_train", data=Y_train_data, maxshape=(None, 512, 512, 1))
    hf.create_dataset("Y_test", data=Y_test_data, maxshape=(None, 512, 512, 1))

可以看出,将转换后的NumPy数组分为四个不同的组",这些组"存储在四个hdf5数据集[X_train, X_test, Y_train, Y_test]中. LoadIPV()功能执行医学图像数据的预处理.

As can be seen, the transformed NumPy arrays are splitted into four different "groups" that are stored into the four hdf5 datasets[X_train, X_test, Y_train, Y_test]. The LoadIPV() function performs the preprocessing of the medical image data.

我的问题是我想将接下来的100个NumPy数组存储到同一.h5文件中,并存储到现有数据集中:这意味着我想追加到例如现有的X_train形状数据集[100, 512, 512, 9]和接下来的100个NumPy数组,这样X_train的形状为[200, 512, 512, 9].其他三个数据集X_testY_trainY_test也应如此.

My problem is that I would like to store the next 100 NumPy arrays into the same .h5 file into the existing datasets: that means that I would like to append to, for example, the existing X_train dataset of shape [100, 512, 512, 9] with the next 100 NumPy arrays, such that X_train becomes of shape [200, 512, 512, 9]. The same should work for the other three datasets X_test, Y_train and Y_test.

推荐答案

我找到了一个可行的解决方案!

I have found a solution that seems to work!

看看这个:用h5py增量写入hdf5

为了将数据附加到特定数据集,必须首先在相应的轴上调整特定数据集的大小,然后在旧" nparray的末尾附加新数据.

In order to append data to a specific dataset it is necessary to first resize the specific dataset in the corresponding axis and subsequently append the new data at the end of the "old" nparray.

因此,解决方案如下所示:

Thus, the solution looks like this:

with h5py.File('.\PreprocessedData.h5', 'a') as hf:
    hf["X_train"].resize((hf["X_train"].shape[0] + X_train_data.shape[0]), axis = 0)
    hf["X_train"][-X_train_data.shape[0]:] = X_train_data

    hf["X_test"].resize((hf["X_test"].shape[0] + X_test_data.shape[0]), axis = 0)
    hf["X_test"][-X_test_data.shape[0]:] = X_test_data

    hf["Y_train"].resize((hf["Y_train"].shape[0] + Y_train_data.shape[0]), axis = 0)
    hf["Y_train"][-Y_train_data.shape[0]:] = Y_train_data

    hf["Y_test"].resize((hf["Y_test"].shape[0] + Y_test_data.shape[0]), axis = 0)
    hf["Y_test"][-Y_test_data.shape[0]:] = Y_test_data

但是,请注意,例如,您应该使用maxshape=(None,)创建数据集

However, note that you should create the dataset with maxshape=(None,), for example

h5f.create_dataset('X_train', data=orig_data, compression="gzip", chunks=True, maxshape=(None,)) 

否则,数据集将无法扩展.

otherwise the dataset cannot be extended.

这篇关于如何使用h5py将数据附加到hdf5文件中的一个特定数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆