将 HDF5 文件中的大型数据集读入 x_train 并在 keras 模型中使用 [英] Reading large dataset from HDF5 file into x_train and use it in keras model

查看:45
本文介绍了将 HDF5 文件中的大型数据集读入 x_train 并在 keras 模型中使用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含 16000 个不同 512x512 numpy 数组的大型 HDF5 文件.显然,将文件读入内存会使其崩溃(文件总大小为 40 GB).

I have a large HDF5 file containing 16000 different 512x512 numpy arrays. obviously reading the file to the ram will make it crash (Total size of the file 40 GB).

我想将此数组加载到数据中,然后将数据拆分为train_x 和test_x.Tha 标签存储在本地.

I want to load this array into data and then split data into train_x and test_x. Tha labels are stored locally.

我这样做只是创建了文件的路径而不获取

I did this which only create a path to the file without fetching

    h5 = h5py.File('/file.hdf5', 'r')
    data = h5.get('data')

但是当我尝试将数据拆分为训练和测试时:

but when I try to split data into train and test:

x_train= data[0:14000]
y_train= label[0:16000]
x_test= data[14000:]
y_test= label[14000:16000]

我收到错误

内存错误:无法为形状为 (14000, 256, 256) 且数据类型为 float32 的数组分配 13.42 GiB

MemoryError: Unable to allocate 13.42 GiB for an array with shape (14000, 256, 256) and data type float32

我想批量加载它们并训练 keras 模型,但显然之前的错误不允许我

I want to load them in batches and train a keras model but obviously previous error doesn't allow me to

model.compile(optimizer=Adam(learning_rate =0.001),loss 
                          ='sparse_categorical_crossentropy',metrics =['accuracy'])
history= model.fit(x_train,y_train,validation_data= 
                         (x_test,y_test),epochs =32,verbose=1)

我该如何解决这个问题?

How can I get around this issue?

推荐答案

首先,让我们描述一下你在做什么.
此语句为名为data"的数据集返回一个 h5py 对象:data = h5.get('data').它不会将整个数据集加载到内存中(这很好).注意:该语句更常见的写法是:data = h5.['data'].另外,我假设有一个类似的调用来获取标签"数据集的 h5py 对象.

First, let's describe what you are doing.
This statement returns a h5py object for the dataset named 'data': data = h5.get('data'). It does NOT load the entire dataset into memory (which is good). Note: that statement is more typically written like this: data = h5.['data']. Also, I assume there is a similar call to get a h5py object for the 'label' dataset.

接下来的 4 条语句中的每一条都将根据索引和数据集返回一个 NumPy 数组.NumPy 数组存储在内存中,这就是您收到内存错误的原因.当程序执行 x_train= data[0:14000] 时,需要 13.42 GiB 才能将数组加载到内存中.(注意:错误意味着数组是 256x256,而不是 512x512.)

Each of your next 4 statements will return a NumPy array based on the indices and dataset. NumPy arrays are stored in memory, which is why you get the memory error. When the program executes x_train= data[0:14000], you need 13.42 GiB to load the array in memory. (Note: the error implies the arrays are 256x256, not 512x512.)

如果您没有足够的 RAM 来存储数组,您将不得不做点什么";以减少内存占用.要考虑的选项:

If you don't have enough RAM to store the array, you will have to "do something" to reduce the memory footprint. Options to consider:

  1. 将图像从 256x256(或 512x512)调整为更小的尺寸并保存在新的 h5 文件中
  2. 修改数据"以使用整数而不是浮点数并保存在新的 h5 文件中
  3. 将图片数据写入.npy文件并批量加载
  4. 读取更少的图像,并批量训练.

我写了一个有点相关的问题的答案,该问题描述了训练和测试数据的 h5py 行为,以及如何从 .npy 文件随机输入.它可能会有所帮助.请参阅此答案:h5py 写作:如何有效地将数百万个 .npy 数组写入 .hdf5 文件?

I wrote an answer to a somewhat related question that describes h5py behavior with training and testing data, and how to randomized input from .npy files. It might be helpful. See this answer: h5py writing: How to efficiently write millions of .npy arrays to a .hdf5 file?

顺便说一句,您可能希望随机选择测试和训练数据(而不是简单地选择前 14000 张图像进行训练和最后 2000 张图像进行测试).此外,请检查 y_train= label[0:16000] 的索引.我认为您会因 x_trainy_train 大小不匹配而出现错误.

As an aside, you probably want to randomize your selection of testing and training data (and not simply pick the first 14000 images for training and the last 2000 images for testing). Also, check your indices for y_train= label[0:16000]. I think you will get an error with mismatched x_train and y_train sizes.

这篇关于将 HDF5 文件中的大型数据集读入 x_train 并在 keras 模型中使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆