如何让数据生成器更高效? [英] How to get Data Generator more efficient?

查看：65 发布时间：2021/6/7 19:57:30 python tensorflow neural-network resize

本文介绍了如何让数据生成器更高效?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

为了训练神经网络，我修改了在 YouTube 上找到的代码.它看起来如下:

To train a neural network, I modified a code I found on YouTube. It looks as follows:

def data_generator(samples, batch_size, shuffle_data = True, resize=224):
  num_samples = len(samples)
  while True:
    random.shuffle(samples)

    for offset in range(0, num_samples, batch_size):
      batch_samples = samples[offset: offset + batch_size]

      X_train = []
      y_train = []

      for batch_sample in batch_samples:
        img_name = batch_sample[0]
        label = batch_sample[1]
        img = cv2.imread(os.path.join(root_dir, img_name))

        #img, label = preprocessing(img, label, new_height=224, new_width=224, num_classes=37)
        img = preprocessing(img, new_height=224, new_width=224)
        label = my_onehot_encoded(label)

        X_train.append(img)
        y_train.append(label)

      X_train = np.array(X_train)
      y_train = np.array(y_train)

      yield X_train, y_train

现在，我尝试使用此代码训练神经网络，训练样本大小为 105.000(图像文件包含 37 种可能性中的 8 个字符，A-Z、0-9 和空格).我使用了一个相对较小的批量大小(32，我认为这已经太小了)来提高效率，但仍然花费了很长时间来训练第一个 epoch 的四分之一(我每个 epoch 有 826 步，花了 90 分钟199 步... steps_per_epoch = num_train_samples//batch_size).

Now, I tried to train a neural network using this code, train sample size is 105.000 (image files which contain 8 characters out of 37 possibilities, A-Z, 0-9 and blank space). I used a relatively small batch size (32, I think that is already too small) to get it more efficient but nevertheless it took like forever to train one quarter of the first epoch (I had 826 steps per epoch, and it took 90 minutes for 199 steps... steps_per_epoch = num_train_samples // batch_size).

数据生成器中包含以下函数:

The following functions are included in the data generator:

def shuffle_data(data):
  data=random.shuffle(data)
  return data

我不认为我们可以使这个函数更高效或从生成器中排除它.

I don't think we can make this function anyhow more efficient or exclude it from the generator.

def preprocessing(img, new_height, new_width):
  img = cv2.resize(img,(new_height, new_width))
  img = img/255
  return img

为了预处理/调整数据大小，我使用此代码将图像设为唯一大小，例如(224, 224, 3).我认为，生成器的这一部分花费的时间最多，但我看不出有可能将其从生成器中排除(因为如果我们在批次外调整图像大小，我的内存会已满).

For preprocessing/resizing the data I use this code to get the images to a unique size of e.g. (224, 224, 3). I think, this part of the generator takes the most time, but I don't see a possibility to exclude it from the generator (since my memory would be full, if we resize the images outside the batches).

#One Hot Encoding of the Labels
from numpy import argmax
# define input string

def my_onehot_encoded(label):
    # define universe of possible input values
    characters = '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ '
    # define a mapping of chars to integers
    char_to_int = dict((c, i) for i, c in enumerate(characters))
    int_to_char = dict((i, c) for i, c in enumerate(characters))
    # integer encode input data
    integer_encoded = [char_to_int[char] for char in label]
    # one hot encode
    onehot_encoded = list()
    for value in integer_encoded:
        character = [0 for _ in range(len(characters))]
        character[value] = 1
        onehot_encoded.append(character)

    return onehot_encoded

我认为，在这部分中，可能有一种方法可以提高效率.我正在考虑从生成器中排除这段代码，并在生成器之外生成数组 y_train，这样生成器就不必每次都对标签进行热编码.

I think, in this part there could be one approach to make it more efficient. I am thinking about to exclude this code from the generator and produce the array y_train outside of the generator, so that the generator does not have to one hot encode the labels every time.

你怎么看?或者我应该采用完全不同的方法?

What do you think? Or should I maybe go for a completely different approach?

如何让数据生成器更高效? [英] How to get Data Generator more efficient?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何让数据生成器更高效? [英] How to get Data Generator more efficient?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭