如何让数据生成器更高效? [英] How to get Data Generator more efficient?

查看:65
本文介绍了如何让数据生成器更高效?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为了训练神经网络,我修改了在 YouTube 上找到的代码.它看起来如下:

To train a neural network, I modified a code I found on YouTube. It looks as follows:

def data_generator(samples, batch_size, shuffle_data = True, resize=224):
  num_samples = len(samples)
  while True:
    random.shuffle(samples)

    for offset in range(0, num_samples, batch_size):
      batch_samples = samples[offset: offset + batch_size]

      X_train = []
      y_train = []

      for batch_sample in batch_samples:
        img_name = batch_sample[0]
        label = batch_sample[1]
        img = cv2.imread(os.path.join(root_dir, img_name))

        #img, label = preprocessing(img, label, new_height=224, new_width=224, num_classes=37)
        img = preprocessing(img, new_height=224, new_width=224)
        label = my_onehot_encoded(label)

        X_train.append(img)
        y_train.append(label)

      X_train = np.array(X_train)
      y_train = np.array(y_train)

      yield X_train, y_train

现在,我尝试使用此代码训练神经网络,训练样本大小为 105.000(图像文件包含 37 种可能性中的 8 个字符,A-Z、0-9 和空格).我使用了一个相对较小的批量大小(32,我认为这已经太小了)来提高效率,但仍然花费了很长时间来训练第一个 epoch 的四分之一(我每个 epoch 有 826 步,花了 90 分钟199 步... steps_per_epoch = num_train_samples//batch_size).

Now, I tried to train a neural network using this code, train sample size is 105.000 (image files which contain 8 characters out of 37 possibilities, A-Z, 0-9 and blank space). I used a relatively small batch size (32, I think that is already too small) to get it more efficient but nevertheless it took like forever to train one quarter of the first epoch (I had 826 steps per epoch, and it took 90 minutes for 199 steps... steps_per_epoch = num_train_samples // batch_size).

数据生成器中包含以下函数:

The following functions are included in the data generator:

def shuffle_data(data):
  data=random.shuffle(data)
  return data

我不认为我们可以使这个函数更高效或从生成器中排除它.

I don't think we can make this function anyhow more efficient or exclude it from the generator.

def preprocessing(img, new_height, new_width):
  img = cv2.resize(img,(new_height, new_width))
  img = img/255
  return img

为了预处理/调整数据大小,我使用此代码将图像设为唯一大小,例如(224, 224, 3).我认为,生成器的这一部分花费的时间最多,但我看不出有可能将其从生成器中排除(因为如果我们在批次外调整图像大小,我的内存会已满).

For preprocessing/resizing the data I use this code to get the images to a unique size of e.g. (224, 224, 3). I think, this part of the generator takes the most time, but I don't see a possibility to exclude it from the generator (since my memory would be full, if we resize the images outside the batches).

#One Hot Encoding of the Labels
from numpy import argmax
# define input string

def my_onehot_encoded(label):
    # define universe of possible input values
    characters = '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ '
    # define a mapping of chars to integers
    char_to_int = dict((c, i) for i, c in enumerate(characters))
    int_to_char = dict((i, c) for i, c in enumerate(characters))
    # integer encode input data
    integer_encoded = [char_to_int[char] for char in label]
    # one hot encode
    onehot_encoded = list()
    for value in integer_encoded:
        character = [0 for _ in range(len(characters))]
        character[value] = 1
        onehot_encoded.append(character)

    return onehot_encoded 

我认为,在这部分中,可能有一种方法可以提高效率.我正在考虑从生成器中排除这段代码,并在生成器之外生成数组 y_train,这样生成器就不必每次都对标签进行热编码.

I think, in this part there could be one approach to make it more efficient. I am thinking about to exclude this code from the generator and produce the array y_train outside of the generator, so that the generator does not have to one hot encode the labels every time.

你怎么看?或者我应该采用完全不同的方法?

What do you think? Or should I maybe go for a completely different approach?

推荐答案

我发现您的问题非常有趣,因为您只提供了一些线索.这是我的调查.

I have found your question very intriguing because you give only clues. So here is my investigation.

使用您的片段,我在 YouTube 上找到了 GitHub 存储库 和 3 部分视频教程,主要关注关于使用的好处Python 中的生成器函数.数据基于 this kaggle(我建议查看不同的内核将您已经尝试过的方法与其他 CNN 网络进行比较并查看正在使用的 API.

Using your snippets, I have found GitHub repository and 3 part video tutorial on YouTube that mainly focuses on the benefits of using generator functions in Python. The data is based on this kaggle (I would recommend to check out different kernels on that problem to compare the approach that you already tried with another CNN networks and review API in use).

你不需要从头开始写一个数据生成器,虽然不难,但发明轮子是没有效率的.

You do not need to write a data generator from scratch, though it is not hard, but inventing the wheel is not productive.

  • Keras has the ImageDataGenerator class.
  • Plus here is a more generic example for DataGenerator.
  • Tensorflow offers very neat pipelines with their tf.data.Dataset.

尽管如此,为了解决 kaggle 的任务,模型只需要感知单个图像,因此该模型是一个简单的深度 CNN.但据我所知,您将 8 个随机字符(类)组合成一张图像以识别一次多个班级.对于该任务,您需要 R-CNN 或 YOLO 作为您的模型.我最近刚刚为自己打开了 YOLO v4,并且可以非常快速地使其适用于特定任务.

Nevertheless, to solve the kaggle's task, the model needs to perceive single images only, hence the model is a simple deep CNN. But as I understand, you are combining 8 random characters (classes) into one image to recognize multiple classes at once. For that task, you need R-CNN or YOLO as your model. I just recently opened for myself YOLO v4, and it is possible to make it work for specific task really quick.

关于您的设计和代码的一般建议.

General advice about your design and code.

  • 确保库使用 GPU.它节省了很多时间.(尽管我在 CPU 上以非常快的速度重复了存储库中的花朵实验 - 大约 10 分钟,但结果预测并不比随机猜测好.因此,完整的训练需要在 CPU 上花费大量时间.)
  • 比较不同版本以找出瓶颈.尝试包含 48 张图像(每类 1 张)的数据集,增加每类图像的数量,然后进行比较.缩小图像尺寸、改变模型结构等
  • 在小型人工数据上测试全新模型以证明想法或使用迭代过程,从可以转换为任务的项目开始 (手写识别?).

这篇关于如何让数据生成器更高效?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆