构建 tensorflow 数据集迭代器,生成具有特殊结构的批次 [英] Build tensorflow dataset iterator that produce batches with special structure

查看:61
本文介绍了构建 tensorflow 数据集迭代器,生成具有特殊结构的批次的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

正如我在标题中提到的,我需要具有特殊结构的批次:

111155552222

每个数字代表特征向量.所以每个类有N=4个向量{1,2,5}(M=3),batch size为NxM=12.

为了完成这个任务,我使用了 Tensorflow Dataset API 和 tfrecords:

  • 使用功能构建 tfrecord,每个类 1 个文件
  • 为每个类创建 Dataset 实例,并为每个类初始化迭代器
  • 为了生成批次,我从迭代器列表中采样 M 个随机迭代器,并从每个迭代器生成 N 个特征向量
  • 然后我将功能堆叠在一起
  • ...
  • 批量准备

我担心的是,我有数百个(可能还有数千个)类,并且为每个类存储迭代器看起来不太好(从内存和性能的角度来看).

有没有更好的方法?

解决方案

如果你有按类排序的文件列表,你可以交错数据集:

 将 tensorflow 导入为 tfN = 4record_files = ['class1.tfrecord', 'class5.tfrecord', 'class2.tfrecord']M = len(record_files)数据集 = tf.data.Dataset.from_tensor_slices(record_files)# 考虑 tf.contrib.data.parallel_interleave 进行并行化数据集 = dataset.interleave(tf.data.TFRecordDataset, cycle_length=M, block_length=N)# 考虑传递 num_parallel_calls 或使用 tf.contrib.data.map_and_batch 来提高性能数据集 = dataset.map(parse_function)数据集 = dataset.batch(N * M)

如果你还需要洗牌,你可以在交错步骤中添加:

 将 tensorflow 导入为 tfN = 4record_files = ['class1.tfrecord', 'class5.tfrecord', 'class2.tfrecord']M = len(record_files)SHUFFLE_BUFFER_SIZE = 1000数据集 = tf.data.Dataset.from_tensor_slices(record_files)数据集 = dataset.interleave(lambda record_file: tf.data.TFRecordDataset(record_file).shuffle(SHUFFLE_BUFFER_SIZE),cycle_length=M,block_length=N)数据集 = dataset.map(parse_function)数据集 = dataset.batch(N * M)

注意:两者都interleavebatch如果没有更多剩余元素,则生成部分"输出(参见文档).因此,如果每个批次具有相同的形状和结构对您来说很重要,您就必须特别小心.至于批处理,您可以使用 tf.contrib.data.batch_and_drop_remainder,但据我所知,没有类似的交错替代方法,因此您要么必须确保所有文件都具有相同数量的示例,要么只需添加 repeat 到交错变换.>

编辑 2:

我得到了我认为你想要的东西的概念证明:

 将 tensorflow 导入为 tfNUM_EXAMPLES = 12NUM_CLASSES = 9记录 = [[str(i)] * NUM_EXAMPLES for i in range(NUM_CLASSES)]米 = 3N = 4数据集 = tf.data.Dataset.from_tensor_slices(记录)数据集 = dataset.interleave(tf.data.Dataset.from_tensor_slices,cycle_length=NUM_CLASSES,block_length=N)数据集 = dataset.apply(tf.contrib.data.batch_and_drop_remainder(NUM_CLASSES * N))数据集 = dataset.flat_map(lambda 数据:tf.data.Dataset.from_tensor_slices(tf.split(tf.random_shuffle(tf.reshape(data, (NUM_CLASSES, N))), NUM_CLASSES//M)))数据集 = dataset.map(lambda 数据:tf.reshape(data, (M * N,)))批处理 = dataset.make_one_shot_iterator().get_next()使用 tf.Session() 作为 sess:为真:尝试:b = sess.run(batch)打印(b''.join(b).decode())除了 tf.errors.OutOfRangeError: break

输出:

8888666663333555544447777222200001111222288887777666655555333300004444111188882222255556666000044447777333331111

记录文件的等价物将是这样的(假设记录是一维向量):

 将 tensorflow 导入为 tfNUM_CLASSES = 9record_files = ['class{}.tfrecord'.format(i) for i in range(NUM_CLASSES)]米 = 3N = 4SHUFFLE_BUFFER_SIZE = 1000数据集 = tf.data.Dataset.from_tensor_slices(record_files)数据集 = dataset.interleave(lambda 文件名:tf.data.TFRecordDataset(file_name).shuffle(SHUFFLE_BUFFER_SIZE),cycle_length=NUM_CLASSES,block_length=N)数据集 = dataset.apply(tf.contrib.data.batch_and_drop_remainder(NUM_CLASSES * N))数据集 = dataset.flat_map(lambda 数据:tf.data.Dataset.from_tensor_slices(tf.split(tf.random_shuffle(tf.reshape(data, (NUM_CLASSES, N, -1))), NUM_CLASSES//M)))数据集 = dataset.map(lambda 数据:tf.reshape(data, (M * N, -1)))

这是通过每次读取每个类的 N 个元素并混洗和拆分结果块来实现的.它假设类的数量可以被 M 整除,并且所有文件都具有相同数量的记录.

As I mentioned in the title I need batches with special structure:

1111
5555
2222

Each digit represent feature-vector. So there are N=4 vectors of each classes {1,2,5} (M=3) and batch size is NxM=12.

To accomplish this task I'm using Tensorflow Dataset API and tfrecords:

  • build tfrecord with features, 1 file for each class
  • create instance of Dataset for each class, and initialise iterator for each of them
  • to produce batch I sample M random iterators from list of iterators and from each iterator produce N feature-vectors
  • then I stack features together
  • ...
  • batch ready

My concern is that I have hundreds (and maybe thousands in the feature) of classes and storing iterator for each class doesn't look good (from memory and performance perspective).

Is there a better way?

解决方案

If you have the list of files ordered by class, you can interleave the datasets:

import tensorflow as tf

N = 4
record_files = ['class1.tfrecord', 'class5.tfrecord', 'class2.tfrecord']
M = len(record_files)

dataset = tf.data.Dataset.from_tensor_slices(record_files)
# Consider tf.contrib.data.parallel_interleave for parallelization
dataset = dataset.interleave(tf.data.TFRecordDataset, cycle_length=M, block_length=N)
# Consider passing num_parallel_calls or using tf.contrib.data.map_and_batch for performance
dataset = dataset.map(parse_function)
dataset = dataset.batch(N * M)

EDIT:

If you need also shuffling you can add it in the interleaving step:

import tensorflow as tf

N = 4
record_files = ['class1.tfrecord', 'class5.tfrecord', 'class2.tfrecord']
M = len(record_files)
SHUFFLE_BUFFER_SIZE = 1000

dataset = tf.data.Dataset.from_tensor_slices(record_files)
dataset = dataset.interleave(
    lambda record_file: tf.data.TFRecordDataset(record_file).shuffle(SHUFFLE_BUFFER_SIZE),
    cycle_length=M, block_length=N)
dataset = dataset.map(parse_function)
dataset = dataset.batch(N * M)

NOTE: Both interleave and batch will produce "partial" outputs if there are no more remaining elements (see docs). So you would have to take special care if it is important for you that every batch has the same shape and structure. As for batching, you can use tf.contrib.data.batch_and_drop_remainder, but as far as I know there is not a similar alternative for interleaving, so you would either have to make sure that all of your files have the same number of examples or just add repeat to the interleaving transformation.

EDIT 2:

I got a proof of concept of something like what I think you want:

import tensorflow as tf

NUM_EXAMPLES = 12
NUM_CLASSES = 9
records = [[str(i)] * NUM_EXAMPLES for i in range(NUM_CLASSES)]
M = 3
N = 4

dataset = tf.data.Dataset.from_tensor_slices(records)
dataset = dataset.interleave(tf.data.Dataset.from_tensor_slices,
                             cycle_length=NUM_CLASSES, block_length=N)
dataset = dataset.apply(tf.contrib.data.batch_and_drop_remainder(NUM_CLASSES * N))
dataset = dataset.flat_map(
    lambda data: tf.data.Dataset.from_tensor_slices(
        tf.split(tf.random_shuffle(
            tf.reshape(data, (NUM_CLASSES, N))), NUM_CLASSES // M)))
dataset = dataset.map(lambda data: tf.reshape(data, (M * N,)))
batch = dataset.make_one_shot_iterator().get_next()

with tf.Session() as sess:
    while True:
        try:
            b = sess.run(batch)
            print(b''.join(b).decode())
        except tf.errors.OutOfRangeError: break

Output:

888866663333
555544447777
222200001111
222288887777
666655553333
000044441111
888822225555
666600004444
777733331111

The equivalent with record files would be something like this (assuming records are one-dimensional vectors):

import tensorflow as tf

NUM_CLASSES = 9
record_files = ['class{}.tfrecord'.format(i) for i in range(NUM_CLASSES)]
M = 3
N = 4
SHUFFLE_BUFFER_SIZE = 1000

dataset = tf.data.Dataset.from_tensor_slices(record_files)
dataset = dataset.interleave(
    lambda file_name: tf.data.TFRecordDataset(file_name).shuffle(SHUFFLE_BUFFER_SIZE),
    cycle_length=NUM_CLASSES, block_length=N)
dataset = dataset.apply(tf.contrib.data.batch_and_drop_remainder(NUM_CLASSES * N))
dataset = dataset.flat_map(
    lambda data: tf.data.Dataset.from_tensor_slices(
        tf.split(tf.random_shuffle(
            tf.reshape(data, (NUM_CLASSES, N, -1))), NUM_CLASSES // M)))
dataset = dataset.map(lambda data: tf.reshape(data, (M * N, -1)))

This works by reading N elements of every class each time and shuffling and splitting the resulting block. It assumes that the number of classes is divisible by M and that all the files have the same number of records.

这篇关于构建 tensorflow 数据集迭代器,生成具有特殊结构的批次的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆