如何在Keras中将多个NumPy数组提供给深度学习网络? [英] How to feed multiple NumPy arrays to a deep learning network in Keras?

查看:250
本文介绍了如何在Keras中将多个NumPy数组提供给深度学习网络?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有大约13个NumPy数组存储为文件,这些文件在磁盘上大约需要24 GB.每个文件都针对一个主题,并且由两个数组组成:一个包含输入数据(二维矩阵列表,行表示连续时间),另一个包含数据标签.

I have around 13 NumPy arrays stored as files that take around 24 gigabytes on disk. Each file is for a single subject and consists of two arrays: one containing input data (a list of 2D matrices, rows represent sequential time), and the other one containing labels of the data.

我的最终目标是将所有数据提供给我在Keras中编写的深度学习网络,以对新数据进行分类.但是我不知道如何在不耗尽内存的情况下做到这一点.

My final goal is to feed all the data to a deep learning network I've written in Keras to classify new data. But I don't know how to do it without running out of memory.

我已经阅读了Keras的数据生成器,但是找不到适合我的情况的方法.

I've read about Keras's data generators, but cannot find a way to use it for my situation.

我也查找了HDF5和h5py,但是我不知道如何在不耗尽内存的情况下将所有数据添加到单个阵列(HDF5中的数据集).

I've also looked up HDF5 and h5py, but I don't know how can add all the data to a single array(dataset in HDF5) without running out of memory.

推荐答案

您需要做的是实现一个生成器,以将数据一点一点地馈送到模型中. Keras,确实有 TimeseriesGenerator ,但我认为您无法使用它,因为它需要您首先将整个数据集加载到内存中.值得庆幸的是,keras有一个图像生成器(称为 ImageDataGenerator ),我们将以此为基础创建自定义生成器.

What you need to do is to implement a generator, to feed the data little by little to your model. Keras, does have a TimeseriesGenerator, but I don't think you can use it as it requires you to first load the whole dataset in memory. Thankfully, keras has a generator for images (called ImageDataGenerator), which we will use to base our custom generator off of.

有关其工作原理的前两个词.您有两个主要的类ImageDataGenerator类 (主要处理您要在每个图像上执行的任何预处理)和DirectoryIterator类,该类实际上完成了所有工作.后者是我们将进行修改以获得所需的内容.它的本质作用是:

First two words on how it works. You have two main classes the ImageDataGenerator class (which mostly deals with any preprocessing you want to perform on each image) and the DirectoryIterator class, which actually does all the work. The latter is what we will modify to get what we want. What it essentially does is:

  • keras.preprocessing.image.Iterator继承,该方法实现了许多初始化和生成名为index_array的数组的方法,该数组包含每个批处理中使用的图像的索引.该数组在每次迭代中都会更改,而从中提取的数据则在每个时期都将被随机播放.我们将在此基础上构建我们的生成器,以维护其功能.
  • 搜索目录下的所有图像;标签是从目录结构中推导出来的.它将每个图像的路径及其标签存储在分别称为filenamesclasses的两个类变量中.我们将使用这些相同的变量来存储时间序列及其类的位置.
  • 它具有称为_get_batches_of_transformed_samples()的方法,该方法接受index_array,加载索引与数组索引相对应的图像,并返回一批这些图像和一个包含其类的图像.
  • Inherits from keras.preprocessing.image.Iterator, which implements many methods that initialize and generate an array called index_array that contains the indices of the images that are used in each batch. This array is changed in each iteration, while the data it draws from are shuffled in each epoch. We will build our generator upon this, to maintain its functionality.
  • Searches for all images under a directory; the labels are deduced from the directory structure. It stores the path to each image and its label in two class variable called filenames and classes respectively. We will use these same variables to store the locations of the timeseries and their classes.
  • It has a method called _get_batches_of_transformed_samples() that accepts an index_array, loads the images whose indices correspond to those of the array and returns a batch of these images and one containing their classes.

我建议你做的是:

  1. 编写一个脚本来构造时间序列数据,就像使用ImageDataGenerator时应该如何构造图像一样.这涉及为每个班级创建一个目录,并将每个时间序列分开放置在该目录中.尽管这可能需要比当前选项更多的存储空间,但是在训练模型时不会将数据加载到内存中.
  2. 熟悉 DirectoryIterator 有效.
  3. 定义您自己的生成器类(例如MyTimeseriesGenerator).确保从
  1. Write a script that structures your timeseries data like how you are supposed to structure images when using the ImageDataGenerator. This involves creating a directory for each class and placing each timeseries separatly inside this directory. While this probably will require more storage than your current option, the data won't be loaded in memory while training the model.
  2. Get acquainted on how the DirectoryIterator works.
  3. Define your own generator class (e.g. MyTimeseriesGenerator). Make sure it inherits from the Iterator class mentioned above.
  4. Modify it so that it searches for the format of files you want (e.g. HDF5, npy) and not image formats (e.g. png, jpeg) like it currently does. This is done in the lines 1733-1763. You don't need to make it work on multiple threads like keras' DirectoryIterator does, as this procedure is done only once.
  5. Change the _get_batches_of_transformed_samples() method, so that it reads the file type that you want, instead of reading images (lines 1774-1788). Remove any other image-related functionality the DirectoryIterator has (transforming the images, standardizing them, saving them, etc.)
  6. Make sure that the array returned by the method above matches what you want your model to accept. I'm guessing it should be in the lines of (batch_size, n_timesteps) or (batch_size, n_timesteps, n_feature), for the data and (batch_size, n_classes) for the labels.

仅此而已!听起来比实际困难.熟悉 DirectoryIterator 类后,其他所有内容都是微不足道的.

That's about all! It sounds more difficult than it actually is. Once you get acquainted with the DirectoryIterator class, everything else is trivial.

预期用途(修改代码后):

Intended use (after modifications to the code):

from custom_generator import MyTimeseriesGenerator  # assuming you named your class 
                                                    # MyTimeseriesGenerator and you
                                                    # wrote it in a python file 
                                                    # named custom_generator

train_dir = 'path/to/your/properly/structured/train/directory'
valid_dir = 'path/to/your/properly/structured/validation/directory'

train_gen = MyTimeseriesGenerator(train_dir, batch_size=..., ...)
valid_gen = MyTimeseriesGenerator(valid_dir, batch_size=..., ...)

# instantiate and compile model, define hyper-parameters, callbacks, etc.

model.fit_generator(train_gen, validation_data=valid_gen, epochs=..., ...) 

这篇关于如何在Keras中将多个NumPy数组提供给深度学习网络?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆