将队列附加到tensorflow中的numpy数组以获取数据而不是文件? [英] Attach a queue to a numpy array in tensorflow for data fetch instead of files?

查看:125
本文介绍了将队列附加到tensorflow中的numpy数组以获取数据而不是文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已阅读有关TensorFlow的CNN教程,我正在尝试为我的项目使用相同的模型. 现在的问题在于数据读取.我有大约25000张图像用于训练,大约有5000张图像用于测试和验证.这些文件为png格式,我可以阅读它们并将其转换为numpy.ndarray.

I have read the CNN Tutorial on the TensorFlow and I am trying to use the same model for my project. The problem is now in data reading. I have around 25000 images for training and around 5000 for testing and validation each. The files are in png format and I can read them and convert them into the numpy.ndarray.

教程中的CNN示例使用队列从提供的文件列表中获取记录.我试图通过将图像重塑为一维数组并在其前面附加一个标签值来创建自己的二进制文件.所以我的数据看起来像这样

The CNN example in the tutorials use a queue to fetch the records from the file list provided. I tried to create my own such binary file by reshaping my images into 1-D array and attaching a label value in the front of it. So my data looks like this

[[1,12,34,24,53,...,105,234,102],
 [12,112,43,24,52,...,115,244,98],
....
]

上述数组的单行长度为 22501 ,其中第一个元素是标签.

The single row of the above array is of length 22501 size where the first element is the label.

我将文件转储到使用pickle,并尝试使用 tf.FixedLengthRecordReader以从文件中读取在示例中展示

I dumped the file to using pickle and the tried to read from the file using the tf.FixedLengthRecordReader to read from the file as demonstrated in example

我正在执行与 cifar10_input.py 中给出的相同操作,以读取二进制文件并将其放入记录对象中.

I am doing the same things as given in the cifar10_input.py to read the binary file and putting them into the record object.

现在,当我从文件中读取标签和图像值时,它们会有所不同.我能理解的原因是,泡菜也会将大括号和方括号的额外信息转储到二进制文件中,并且它们会更改固定长度的记录大小.

Now when I read from the files the labels and the image values are different. I can understand the reason for this to be that pickle dumps the extra information of braces and brackets also in the binary file and they change the fixed length record size.

上面的示例使用文件名并将其传递到队列中以获取文件,然后使用队列从文件中读取一条记录.

The above example uses the filenames and pass it to a queue to fetch the files and then the queue to read a single record from the file.

我想知道是否可以将上面定义的numpy数组而不是文件名传递给某些读取器,并且它可以从该数组而不是文件中逐个读取记录.

I want to know if I can pass the numpy array as defined above instead of the filenames to some reader and it can fetch records one by one from that array instead of the files.

推荐答案

使数据与CNN示例代码一起使用的最简单方法可能是制作

Probably the easiest way to make your data work with the CNN example code is to make a modified version of read_cifar10() and use it instead:

  1. 写出一个包含numpy数组内容的二进制文件.

  1. Write out a binary file containing the contents of your numpy array.

import numpy as np
images_and_labels_array = np.array([[...], ...],  # [[1,12,34,24,53,...,102],
                                                  #  [12,112,43,24,52,...,98],
                                                  #  ...]
                                   dtype=np.uint8)

images_and_labels_array.tofile("/tmp/images.bin")

此文件类似于CIFAR10数据文件中使用的格式.您可能想要生成多个文件以获取读取并行性.请注意, ndarray.tofile() 将二进制数据写入行-没有其他元数据的主要订单;对该数组进行腌制将添加TensorFlow的解析例程无法理解的特定于Python的元数据.

This file is similar to the format used in CIFAR10 datafiles. You might want to generate multiple files in order to get read parallelism. Note that ndarray.tofile() writes binary data in row-major order with no other metadata; pickling the array will add Python-specific metadata that TensorFlow's parsing routines do not understand.

编写read_cifar10()的修改版本以处理您的记录格式.

Write a modified version of read_cifar10() that handles your record format.

def read_my_data(filename_queue):

  class ImageRecord(object):
    pass
  result = ImageRecord()

  # Dimensions of the images in the dataset.
  label_bytes = 1
  # Set the following constants as appropriate.
  result.height = IMAGE_HEIGHT
  result.width = IMAGE_WIDTH
  result.depth = IMAGE_DEPTH
  image_bytes = result.height * result.width * result.depth
  # Every record consists of a label followed by the image, with a
  # fixed number of bytes for each.
  record_bytes = label_bytes + image_bytes

  assert record_bytes == 22501  # Based on your question.

  # Read a record, getting filenames from the filename_queue.  No
  # header or footer in the binary, so we leave header_bytes
  # and footer_bytes at their default of 0.
  reader = tf.FixedLengthRecordReader(record_bytes=record_bytes)
  result.key, value = reader.read(filename_queue)

  # Convert from a string to a vector of uint8 that is record_bytes long.
  record_bytes = tf.decode_raw(value, tf.uint8)

  # The first bytes represent the label, which we convert from uint8->int32.
  result.label = tf.cast(
      tf.slice(record_bytes, [0], [label_bytes]), tf.int32)

  # The remaining bytes after the label represent the image, which we reshape
  # from [depth * height * width] to [depth, height, width].
  depth_major = tf.reshape(tf.slice(record_bytes, [label_bytes], [image_bytes]),
                           [result.depth, result.height, result.width])
  # Convert from [depth, height, width] to [height, width, depth].
  result.uint8image = tf.transpose(depth_major, [1, 2, 0])

  return result

  • 修改 distorted_inputs() 使用新的数据集:

  • Modify distorted_inputs() to use your new dataset:

    def distorted_inputs(data_dir, batch_size):
      """[...]"""
      filenames = ["/tmp/images.bin"]  # Or a list of filenames if you
                                       # generated multiple files in step 1.
      for f in filenames:
        if not gfile.Exists(f):
          raise ValueError('Failed to find file: ' + f)
    
      # Create a queue that produces the filenames to read.
      filename_queue = tf.train.string_input_producer(filenames)
    
      # Read examples from files in the filename queue.
      read_input = read_my_data(filename_queue)
      reshaped_image = tf.cast(read_input.uint8image, tf.float32)
    
      # [...] (Maybe modify other parameters in here depending on your problem.)
    

  • 根据您的出发点,本步骤旨在作为最少的步骤.使用 TensorFlow操作进行PNG解码可能会更有效,但是将会是一个更大的变化.

    This is intended to be a minimal set of steps, given your starting point. It may be more efficient to do the PNG decoding using TensorFlow ops, but that would be a larger change.

    这篇关于将队列附加到tensorflow中的numpy数组以获取数据而不是文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆