如何在python中处理图像的大型数据集? [英] How do I process a large dataset of images in python?

查看：62 发布时间：2021/5/12 20:00:03 python numpy python-imaging-library google-colaboratory

本文介绍了如何在python中处理图像的大型数据集?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个从Google云端硬盘导入的大约10,000张图像的大型数据集，我希望将它们转换为一个numpy数组，以便可以训练我的机器学习模型.问题是我的方法花费的时间太长，并且在RAM上非常占用空间.

I have a large dataset of around 10,000 image imported from Google drive, and I wish to turn them into a numpy array so I can train my machine learning model. The problem is that my way is taking too long and is very space-consuming on the RAM.

from PIL import Image
import glob  

train_images = glob.glob('/content/drive/MyDrive/AICW/trainy/train/*.jpg')

x_train = np.array([np.array(Image.open(image)) for image in train_images])

即使在30分钟之后，甚至当我设法获得一个numpy数组时，这些代码行仍在运行.它是不同大小和尺寸的图像的集合(例如，有些是450 X 600，有些是500 X 600)，当我将它们输入到模型中时会出现问题.一定有一种更节省时间和空间的方法吧?

These lines of codes were still running even after 30 minutes and even when I managed to get a numpy array. It is a collection of images of different sizes and dimensions (eg some are 450 X 600 and others are 500 X 600), which is going to be problematic when I feed them into my model. There must be a way that's more time and space efficient right?

P.s我正在Google colab上运行所有这些程序.图像总数为10,270.大小因图像而异，但通常都为450 x 600 x 3.

P.s I'm running all these on Google colab. The total number of images is 10,270. Size varies from image to image but they all generally have a size of 450 by 600 by 3.

推荐答案

注释中有很多好的建议(最重要的是，如果不调整图像大小，则 x_train 的总大小).如前所述，如果要使用不同大小的数组，只需将 x_train 设置为列表(而不是np.array)即可.最终，您可能需要调整大小以进行培训和测试.枕头文档显示使用 .asarray()将图像转换为NumPy数组.不确定是否重要.
我对您的代码进行了明显的修改，以:1)将 train_x 创建为 dtype = object 的空数组(以保存图像数组)，2)调整图像的大小，3)使用 .asarray()来转换图像.在具有24 GB RAM的台式机系统上，它可以在几秒钟内将26640个图像读取到阵列中.
以下代码:

Lots of good suggestions in the comments (mostly importantly the total size of x_train if you don't resize the images). As noted, if you want to use arrays of different size, simply make x_train a list (instead of a np.array). Eventually you probably need to resize for training and testing. The Pillow docs show image conversion to NumPy array with .asarray(). Not sure if that matters.
I modified your code sightly to 1) create train_x as an empty array of dtype=object (to hold the image arrays), 2) resize the images and 3) use .asarray() to convert the images. It reads 26640 images into an array in a few seconds on a desktop system with 24 GB RAM.
Code below:

train_images = glob.glob('*/*.jpg', recursive=True)
x_train = np.empty(shape=(len(train_images),), dtype=object)
size = 128, 128

for i, image in enumerate(train_images):
    x_train[i] = np.asarray(Image.open(image).thumbnail(size))

这篇关于如何在python中处理图像的大型数据集?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在python中处理图像的大型数据集? [英] How do I process a large dataset of images in python?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何在python中处理图像的大型数据集? [英] How do I process a large dataset of images in python?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭