如何为H5配置maxshape参数并追加到文件? [英] How to configure maxshape argument for H5 and append to file?

查看:98
本文介绍了如何为H5配置maxshape参数并追加到文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将图像数据集合并到H5文件中.到目前为止,我已经设法创建了文件,但是当我追加文件时,它只会覆盖已经存在的文件.我查看了其他答案(例如添加将数据使用h5py沿新轴导入到现有的h5py文件),并尝试了它们的变体,但无济于事.

I'm trying to combine an image dataset into a H5 file. So far I have managed to create the file but when I append to it, it just overwrites what's already there. I've looked at other answers (e.g. Adding data to existing h5py file along new axis using h5py) and tried their variations but to no avail.

for i in range(len(files)):
    if i == 0:
        with h5py.File('input_images.h5', 'w') as f:
            img = np.array(Image.open(files[i]))
            f.create_dataset('/array', data = img, maxshape = (None), chunks = True, dtype = img.dtype)
    else:
        with h5py.File('input_images.h5', 'r+') as f:
            img = np.array(Image.open(files[i]))
            f.require_dataset('/array', data = img, shape = img.shape, dtype = img.dtype)
    print(i)

我尝试将maxshape设置为((无,无,无)),但这只会产生错误: ValueError:"maxshape"必须与数据集形状具有相同的等级

I've tried setting maxshape to (None, None, None) but that just creates an error: ValueError: "maxshape" must have same rank as dataset shape

总共有1000张图像,每个图像的形状都是2048 x2048.有人可以告诉我如何修复我的代码吗?

There are 1000 images in total, each of shape 2048 by 2048. Can someone show me how to fix my code?

推荐答案

使用 maxshape 参数可以修改数据集的大小.请注意, maxshape 需要与图像数据集的尺寸匹配.您输入了1个维度,但所有图像数据(1000、2048、2048)都需要输入3.同样,代码中的初始数据集大小是根据 data = img 数组大小的大小设置的.它将具有形状(2048,2048).数据集需要所有图像数据的第三维.
有3种方法可以加载所有图像数据:
1.将 shape =(nfiles,a1,a2)设置为所有图像的初始大小.除非您要稍后添加更多图像,否则无需调整大小.
2.最初设置 shape =(1,a1,a2)(用于1张图像),然后在添加图像时使用 .resize()增大尺寸.随着数据集的增长,此方法不是很有效.
3.最初设置 shape =(N,a1,a2)(用于N张图像),然后使用 .resize()在数据集已满时将大小增加N.(N可以是任何数字.在下面的示例中,我使用10,但对于实际应用,您可能使用100或1000.)

Using the maxshape parameter allows you to modify the dataset size. Note, maxshape needs to match of dimensions of your image dataset. You entered 1 dimension, but need 3 for all image data (1000, 2048, 2048). Also the initial dataset size in your code is set from the size of the data=img array size. It will have shape (2048,2048). The dataset needs a third dimension for all image data.
There are 3 approaches to load all your image data:
1. Set shape=(nfiles,a1,a2) to initially size for all images. No need to resize unless you want add more images later.
2. Initially set shape=(1,a1,a2) (for 1 image), then use .resize() to increase the size as you add images. This method is not very efficient as your datasets grow.
3. Initially set shape=(N,a1,a2) (for N images), then use .resize() to increase the size by N when the dataset is full. (N can be any number. I used 10 in the example below, but you might use 100 or 1000 for a real world application).

在下面的示例中,所有3种方法均适用于30张带有较小图像尺寸的图像.我为图像创建随机整数数据.用您的文件 np.array(Image.open(files [i]))替换 np.random.randint().

All 3 methods are in the example below for 30 images w/ a smaller image size. I create random integer data for the images. Replace np.random.randint() with np.array(Image.open(files[i])) for your files.

这些示例演示了该过程.请注意,方法1和2仅在创建HDF5文件并填充成像数据时才起作用(因为数据集索引与图像计数器相同).方法3显示了如何以增量方式添加数据.它使用一个属性来计数加载的图像数.计数器设置添加新图像的位置.它还用于检查当前数据集的大小(并根据需要调整大小).

The examples demonstrates the process. Note that Methods 1 and 2 will only work when you create the HDF5 file and populate the imaged data (because the dataset index is the same as the image counter). Method 3 shows how to add data incrementally. It uses an attribute that counts the number of images loaded. The counter sets the position to add the new image. It is also used to check current dataset size (and resize as needed).

在生产代码中,您需要进行其他检查,以确保图像大小和形状与数据集的大小和形状相匹配.

In production code you need additional checks that image size and shape match dataset size and shape.

import h5py
import numpy as np
nfiles=30
a0 = nfiles  # for number of images
a1= 256 ; a2 = 256 # for image size

with h5py.File('input_images1.h5', 'w') as f:    
    for i in range(nfiles):
        img_arr = np.random.randint(0,254, (a1, a2), int)
        if i == 0:
            img_ds = f.create_dataset('/array', shape=(a0,a1,a2), 
                             maxshape = (None,a1,a2), chunks = True)
        f['/array'][i,:,:]=img_arr
        print(i)

with h5py.File('input_images2.h5', 'w') as f:    
    for i in range(nfiles):
        img_arr = np.random.randint(0,254, (a1, a2), int)
        if i == 0:
            img_ds = f.create_dataset('/array', shape=(1,a1,a2), 
                             maxshape = (None,a1,a2), chunks = True)
        else:
            f['/array'].resize(i+1,axis=0)
        f['/array'][i,:,:]=img_arr
        print(i)        

with h5py.File('input_images3.h5', 'a') as f:
    for i in range(nfiles):
        img_arr = np.random.randint(0,254, (a1, a2), int)
        if 'array' not in f.keys() :
            img_ds = f.create_dataset('/array', shape=(10,a1,a2), 
                             maxshape = (None,a1,a2), chunks = True)
            img_ds.attrs['n_images'] = 0
        else:
            img_ds = f['/array']

        n_images = img_ds.attrs['n_images']
        if n_images == img_ds.shape[0] :
            print ('adding 10 rows to /array')
            img_ds .resize(img_ds.shape[0]+10,axis=0)

        img_ds[n_images,:,:]=img_arr
        img_ds.attrs['n_images'] = n_images+1
        print(img_ds.attrs['n_images'])   

这篇关于如何为H5配置maxshape参数并追加到文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆