读取包含存储为 numpy 数组的图像的 hdf5 文件的最有效方法是什么? [英] What is the most efficient way to read an hdf5 file containing an image stored as a numpy array?

查看:77
本文介绍了读取包含存储为 numpy 数组的图像的 hdf5 文件的最有效方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我将图像文件转换为 hdf5 文件,如下所示:

I'm converting image files to hdf5 files as follows:

import h5py
import io
import os
import cv2
import numpy as np
from PIL import Image

def convertJpgtoH5(input_dir, filename, output_dir):
    filepath = input_dir + '/' + filename
    print('image size: %d bytes'%os.path.getsize(filepath))
    img_f = open(filepath, 'rb')
    binary_data = img_f.read()
    binary_data_np = np.asarray(binary_data)
    new_filepath = output_dir + '/' + filename[:-4] + '.hdf5'
    f = h5py.File(new_filepath, 'w')
    dset = f.create_dataset('image', data = binary_data_np)
    f.close()
    print('hdf5 file size: %d bytes'%os.path.getsize(new_filepath))

pathImg = '/path/to/images'
pathH5 = '/path/to/hdf5/files'
ext = [".jpg", ".jpeg", ".png", ".bmp", ".tiff", ".tif"]

for img in os.listdir(pathImg):
        if img.endswith(tuple(ext)):
            convertJpgtoH5(pathImg, img, pathH5)

我后来阅读了这些 hdf5 文件如下:

I later read these hdf5 files as follows:

for hf in os.listdir(pathH5):
    if hf.endswith(".hdf5"):
        hf = h5py.File(f"{pathH5}/{hf}", "r")
        key = list(hf.keys())[0]
        data = np.array(hf[key]) 
        img = Image.open(io.BytesIO(data))
        image = cv2.cvtColor(np.float32(img), cv2.COLOR_BGR2RGB)
        hf.close()

是否有更有效的方法来读取 hdf5 文件而不是转换为 numpy 数组,在使用 OpenCV 之前先用 Pillow 打开?

Is there a more efficient way to read the hdf5 files rather than converting to numpy array, opening with Pillow before using with OpenCV?

推荐答案

理想情况下,这应该作为副本关闭,因为我在上面的评论中引用的答案中解释了您想要做的大部分事情.我在这里包括这些链接:

Ideally this should be closed as a duplicate because most of what you want to do is explained in the answers I referenced in my comments above. I am including those links here:

有一个区别:我的示例将所有图像数据加载到 1 个 HDF5 文件中,并且您正在为每个图像创建 1 个 HDF5 文件.坦率地说,我认为这样做没有多大价值.您最终获得了两倍的文件,但没有任何收获.如果您仍然对此感兴趣,这里还有 2 个可能有帮助的答案(我在最后更新了您的代码):

There is one difference: my examples load all the image data into 1 HDF5 file, and you are creating 1 HDF5 file for each image. Frankly, I don't think there is much value doing that. You wind up with twice as many files and there's nothing gained. If you are still interested in doing that, here are 2 more answers that might help (and I updated your code at the end):

为了解决您的具体问题,我修改了您的代码以仅使用 cv2(不需要 PIL).我调整了图像大小并保存为 1 个文件中的 1 个数据集.如果您使用图像来训练和测试 CNN 模型,则无论如何都需要这样做(它需要大小/一致形状的数组).另外,我认为您可以将数据保存为 int8 —— 不需要浮点数.见下文.

In the interest of addressing your specific question, I modified your code to use cv2 only (no need for PIL). I resized the images and saved as 1 dataset in 1 file. If you are using the images for training and testing a CNN model, you need to do this anyway (it needs arrays of size/consistent shape). Also, I think you can save the data as int8 -- no need for floats. See below.

import h5py
import glob
import os
import cv2
import numpy as np

def convertImagetoH5(imgfilename):
    print('image size: %d bytes'%os.path.getsize(imgfilename))
    img = cv2.imread(imgfilename, cv2.COLOR_BGR2RGB)
    img_resize = cv2.resize(img, (IMG_WIDTH, IMG_HEIGHT) )
    return img_resize 


pathImg = '/path/to/images'
pathH5 = '/path/to/hdf5file'
ext_list = [".ppm", ".jpg", ".jpeg", ".png", ".bmp", ".tiff", ".tif"]
IMG_WIDTH = 120
IMG_HEIGHT = 120

#get list of all images and number of images
all_images = []
for ext in ext_list:
    all_images.extend(glob.glob(pathImg+"/*"+ext, recursive=True))
n_images = len(all_images)

ds_img_arr = np.zeros((n_images, IMG_WIDTH, IMG_HEIGHT,3),dtype=np.uint8)

for cnt,img in enumerate(all_images):
    img_arr = convertImagetoH5(img)
    ds_img_arr[cnt]=img_arr[:]
    
h5_filepath = pathH5 + '/all_image_data.hdf5'
with h5py.File(h5_filepath, 'w') as h5f:
    dset = h5f.create_dataset('images', data=ds_img_arr)

print('hdf5 file size: %d bytes'%os.path.getsize(h5_filepath))

with h5py.File(h5_filepath, "r") as h5r:
    key = list(h5r.keys())[0]
    print (key, h5r[key].shape, h5r[key].dtype)

如果您真的想要每张图片 1 个 HDF5,您的问题中的代码会在下面更新.同样,只使用了 cv2 —— 不需要 PIL.图像不会调整大小.这只是为了完整性(以演示过程).这不是您应该如何管理您的图像数据.

If you really want 1 HDF5 for each image, the code from your question is updated below. Again, only cv2 is used -- no need for PIL. Images are not resized. This is for completeness only (to demonstrate the process). It's not how you should manage your image data.

import h5py
import os
import cv2
import numpy as np

def convertImagetoH5(input_dir, filename, output_dir):
    filepath = input_dir + '/' + filename
    print('image size: %d bytes'%os.path.getsize(filepath))
    img = cv2.imread(filepath, cv2.COLOR_BGR2RGB)
    new_filepath = output_dir + '/' + filename[:-4] + '.hdf5'
    with h5py.File(new_filepath, 'w') as h5f:
        h5f.create_dataset('image', data =img)
    print('hdf5 file size: %d bytes'%os.path.getsize(new_filepath))

pathImg = '/path/to/images'
pathH5 = '/path/to/hdf5file'
ext = [".ppm", ".jpg", ".jpeg", ".png", ".bmp", ".tiff", ".tif"]

# Loop thru image files and create a matching HDF5 file
for img in os.listdir(pathImg):
        if img.endswith(tuple(ext)):
            convertImagetoH5(pathImg, img, pathH5)

# Loop thru HDF5 files and read image dataset (as an array)
for h5name in os.listdir(pathH5):
    if h5name.endswith(".hdf5"):
        with h5f = h5py.File(f"{pathH5}/{h5name}", "r") as h5f:
            key = list(h5f.keys())[0]
            image = h5f[key][:]
            print(f'{h5name}: {image.shape}, {image.dtype}')

这篇关于读取包含存储为 numpy 数组的图像的 hdf5 文件的最有效方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆