如何在python中处理图像的大型数据集? [英] How do I process a large dataset of images in python?

查看:62
本文介绍了如何在python中处理图像的大型数据集?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个从Google云端硬盘导入的大约10,000张图像的大型数据集,我希望将它们转换为一个numpy数组,以便可以训练我的机器学习模型.问题是我的方法花费的时间太长,并且在RAM上非常占用空间.

I have a large dataset of around 10,000 image imported from Google drive, and I wish to turn them into a numpy array so I can train my machine learning model. The problem is that my way is taking too long and is very space-consuming on the RAM.

from PIL import Image
import glob  

train_images = glob.glob('/content/drive/MyDrive/AICW/trainy/train/*.jpg')

x_train = np.array([np.array(Image.open(image)) for image in train_images])

即使在30分钟之后,甚至当我设法获得一个numpy数组时,这些代码行仍在运行.它是不同大小和尺寸的图像的集合(例如,有些是450 X 600,有些是500 X 600),当我将它们输入到模型中时会出现问题.一定有一种更节省时间和空间的方法吧?

These lines of codes were still running even after 30 minutes and even when I managed to get a numpy array. It is a collection of images of different sizes and dimensions (eg some are 450 X 600 and others are 500 X 600), which is going to be problematic when I feed them into my model. There must be a way that's more time and space efficient right?

P.s我正在Google colab上运行所有这些程序.图像总数为10,270.大小因图像而异,但通常都为450 x 600 x 3.

P.s I'm running all these on Google colab. The total number of images is 10,270. Size varies from image to image but they all generally have a size of 450 by 600 by 3.

推荐答案

注释中有很多好的建议(最重要的是,如果不调整图像大小,则 x_train 的总大小).如前所述,如果要使用不同大小的数组,只需将 x_train 设置为列表(而不是np.array)即可.最终,您可能需要调整大小以进行培训和测试.枕头文档显示使用 .asarray()将图像转换为NumPy数组.不确定是否重要.
我对您的代码进行了明显的修改,以:1)将 train_x 创建为 dtype = object 的空数组(以保存图像数组),2)调整图像的大小,3)使用 .asarray()来转换图像.在具有24 GB RAM的台式机系统上,它可以在几秒钟内将26640个图像读取到阵列中.
以下代码:

Lots of good suggestions in the comments (mostly importantly the total size of x_train if you don't resize the images). As noted, if you want to use arrays of different size, simply make x_train a list (instead of a np.array). Eventually you probably need to resize for training and testing. The Pillow docs show image conversion to NumPy array with .asarray(). Not sure if that matters.
I modified your code sightly to 1) create train_x as an empty array of dtype=object (to hold the image arrays), 2) resize the images and 3) use .asarray() to convert the images. It reads 26640 images into an array in a few seconds on a desktop system with 24 GB RAM.
Code below:

train_images = glob.glob('*/*.jpg', recursive=True)
x_train = np.empty(shape=(len(train_images),), dtype=object)
size = 128, 128

for i, image in enumerate(train_images):
    x_train[i] = np.asarray(Image.open(image).thumbnail(size))

这篇关于如何在python中处理图像的大型数据集?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆