使用生成器从成批的.npy文件中训练Keras模型吗? [英] Training a Keras model from batches of .npy files using generator?

查看:669
本文介绍了使用生成器从成批的.npy文件中训练Keras模型吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当前,在使用Keras训练图像数据时,我正在处理一个大数据问题.我的目录中有一批.npy文件.每批包含512张图像.每个批次的对应标签文件均为.npy.看起来像是:{image_file_1.npy,label_file_1.npy,...,image_file_37.npy,label_file_37}.每个图像文件的尺寸为(512, 199, 199, 3),每个标签文件的尺寸为(512, 1)(为1或0).如果我将所有图像加载到一个ndarray中,则将超过35 GB.到目前为止,已阅读了所有《 Keras Doc》.我仍然找不到如何使用自定义生成器进行训练的方法.我已经读过关于flow_from_dictImageDataGenerator(...).flow()的信息,但是在这种情况下它们并不理想,或者我不知道如何自定义它们.

Currently I am dealing with a big data issue when training Image data using Keras. I have directory which has batch of .npy file. Each batch contain 512 images. Each batch has its corresponding label file as .npy. So it looks like: {image_file_1.npy, label_file_1.npy, ..., image_file_37.npy, label_file_37}. Each image file has dimension (512, 199, 199, 3), each label file has dimension (512, 1)(eather 1 or 0) . If I load all the images in one ndarray it will be 35+ GB. So far reading all the Keras Doc. I am still not able to find how I will be able to train using custom generator. I have read about flow_from_dict and ImageDataGenerator(...).flow() but they are not ideal in that case or I do not know how to customized them.Here what I have done.

import numpy as np
import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras.optimizers import SGD
from keras.preprocessing.image import ImageDataGenerator

val_gen = ImageDataGenerator(rescale=1./255)
x_test = np.load("../data/val_file.npy")
y_test = np.load("../data/val_label.npy")
val_gen.fit(x_test)

model = Sequential()
...
model_1.add(layers.Dense(512, activation='relu'))
model_1.add(layers.Dense(1, activation='sigmoid'))

model.compile(loss='categorical_crossentropy', 
              optimizer=sgd, 
               metrics=['acc'])

model.fit_generator(generate_batch_from_directory() # should give 1 image file and 1 label file
                    validation_data=val_gen.flow(x_test, 
                                                 y_test, 
                                                 batch_size=64),
                    validation_steps=32)

因此,这里generate_batch_from_directory()每次都应取image_file_i.npylabel_file_i.npy并优化重量,直到没有剩余的批次为止. .npy文件中的每个图像数组都已进行了扩充,旋转和缩放处理.每个.npy文件都正确混合了1级和0级(50/50)的数据.

So here generate_batch_from_directory() should take image_file_i.npy and label_file_i.npy every time and optimise the weight until there is no batch left. Each image array in the .npy files has already been processed with augmentation, rotation and scaling. Each .npy file is properly mixed with data from class 1 and 0 (50/50).

如果我将所有批处理追加并创建一个大文件,例如:

If I append all the batch and create a big file such as:

X_train = np.append([image_file_1, ..., image_file_37])
y_train = np.append([label_file_1, ..., label_file_37])

它不适合内存.否则,我可以使用.flow()生成图像集来训练模型.

It does not fit in the memory. Otherwise I could use .flow() to generate image sets to train the model.

谢谢您的建议.

推荐答案

最后,我能够解决该问题.但是我必须浏览keras.utils.Sequence的源代码和文档才能构建自己的生成器类. 此文件对了解生成器在Kears中的工作方式.您可以在我的 kaggle笔记本中阅读更多详细信息:

Finally I was able to solve that problem. But I had to go through source code and documentation of keras.utils.Sequence to build my own generator class. This document help a lot to understand how generator works in Kears. You can read more detail in my kaggle notebook:

all_files_loc = "datapsycho/imglake/population/train/image_files/"
all_files = os.listdir(all_files_loc)

image_label_map = {
        "image_file_{}.npy".format(i+1): "label_file_{}.npy".format(i+1)
        for i in range(int(len(all_files)/2))}
partition = [item for item in all_files if "image_file" in item]

class DataGenerator(keras.utils.Sequence):

    def __init__(self, file_list):
        """Constructor can be expanded,
           with batch size, dimentation etc.
        """
        self.file_list = file_list
        self.on_epoch_end()

    def __len__(self):
      'Take all batches in each iteration'
      return int(len(self.file_list))

    def __getitem__(self, index):
      'Get next batch'
      # Generate indexes of the batch
      indexes = self.indexes[index:(index+1)]

      # single file
      file_list_temp = [self.file_list[k] for k in indexes]

      # Set of X_train and y_train
      X, y = self.__data_generation(file_list_temp)

      return X, y

    def on_epoch_end(self):
      'Updates indexes after each epoch'
      self.indexes = np.arange(len(self.file_list))

    def __data_generation(self, file_list_temp):
      'Generates data containing batch_size samples'
      data_loc = "datapsycho/imglake/population/train/image_files/"
      # Generate data
      for ID in file_list_temp:
          x_file_path = os.path.join(data_loc, ID)
          y_file_path = os.path.join(data_loc, image_label_map.get(ID))

          # Store sample
          X = np.load(x_file_path)

          # Store class
          y = np.load(y_file_path)

      return X, y

# ====================
# train set
# ====================
all_files_loc = "datapsycho/imglake/population/train/image_files/"
all_files = os.listdir(all_files_loc)

training_generator = DataGenerator(partition)
validation_generator = ValDataGenerator(val_partition) # work same as training generator

hst = model.fit_generator(generator=training_generator, 
                           epochs=200, 
                           validation_data=validation_generator,
                           use_multiprocessing=True,
                           max_queue_size=32)

这篇关于使用生成器从成批的.npy文件中训练Keras模型吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆