如何正确映射python函数然后在Tensorflow中批处理数据集 [英] How to correctly map a python function and then batch the Dataset in Tensorflow

查看:166
本文介绍了如何正确映射python函数然后在Tensorflow中批处理数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望创建一个管道来向神经网络提供非标准文件(例如,扩展名为* .xxx). 目前,我的代码结构如下:

I wish to create a pipeline to provide non-standard files to the neural network (for example with extension *.xxx). Currently I have structured my code as follows:

1)我定义了找到培训文件的路径列表

  1) I define a list of paths where to find training files

2)我定义了包含这些路径的tf.data.Dataset对象的实例

  2) I define an instance of the tf.data.Dataset object containing these paths

3)我将一个python函数映射到数据集,该函数采用每个路径并返回关联的numpy数组(从pc上的文件夹加载);该数组是尺寸为[256、256、192]的矩阵.

  3) I map to the Dataset a python function that takes each path and returns the associated numpy array (loaded from the folder on the pc); this array is a matrix with dimensions [256, 256, 192].

4)我定义了一个可初始化的迭代器,然后在网络训练期间使用它.

  4) I define an initializable iterator and then use it during network training.

我的疑问在于我提供给网络的批量大小.我想向网络提供批量为64的尺寸.我该怎么办? 例如,如果我使用函数train_data.batch(b_size)且b_size = 1,则结果是迭代时,迭代器给出一个形状为[256,256,192]的元素;如果我只想用这个数组的64个切片来喂神经网络怎么办?

My doubt lies in the size of the batch I provide to the network. I would like to have batches of size 64 supplied to the network. How could I do? For example, if I use the function train_data.batch(b_size) with b_size = 1 the result is that when iterated, the iterator gives one element of shape [256, 256, 192]; what if I wanted to feed the neural net with just 64 slices of this array?

这是我的代码的一部分:

This is an extract of my code:

    with tf.name_scope('data'):
        train_filenames = tf.constant(list_of_files_train)

        train_data = tf.data.Dataset.from_tensor_slices(train_filenames)
        train_data = train_data.map(lambda filename: tf.py_func(
            self._parse_xxx_data, [filename], [tf.float32]))

        train_data.shuffle(buffer_size=len(list_of_files_train))
        train_data.batch(b_size)

        iterator = tf.data.Iterator.from_structure(train_data.output_types, train_data.output_shapes)

        input_data = iterator.get_next()
        train_init = iterator.make_initializer(train_data)

  [...]

  with tf.Session() as sess:
      sess.run(train_init)
      _ = sess.run([self.train_op])

预先感谢

我在下面的评论中发布了针对我的问题的解决方案.我仍然很高兴收到关于可能的改进的任何评论或建议.谢谢;)

推荐答案

已经很长时间了,但是我将发布一个可能的解决方案,以在TensorFlow中对具有自定义形状的数据集进行批处理,以防万一有人需要它.

It's been a long time but I'll post a possible solution to batch the dataset with custom shape in TensorFlow, in case someone may need it.

模块tf.data提供方法unbatch()来解包每个数据集元素的内容.可以先取消批处理,然后再以所需的方式再次批处理数据集对象.通常,一个好主意也可能是在重新批处理之前将未批处理的数据集改组(这样我们就可以从每批中的随机元素中获取随机切片):

The module tf.data offers the method unbatch() to unwrap the content of each dataset element. One can first unbatch and than batch again the dataset object in the desired way. Oftentimes, a good idea may also be shuffling the unbatched dataset before batching it again (so that we have random slices from random elements in each batch):

with tf.name_scope('data'):
    train_filenames = tf.constant(list_of_files_train)

    train_data = tf.data.Dataset.from_tensor_slices(train_filenames)
    train_data = train_data.map(lambda filename: tf.py_func(
        self._parse_xxx_data, [filename], [tf.float32]))

    # un-batch first, then batch the data
    train_data = train_data.apply(tf.data.experimental.unbatch())

    train_data.shuffle(buffer_size=BSIZE)
    train_data.batch(b_size)

    # [...]

这篇关于如何正确映射python函数然后在Tensorflow中批处理数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆