构建 tensorflow 数据集迭代器,生成具有特殊结构的批次 [英] Build tensorflow dataset iterator that produce batches with special structure
问题描述
正如我在标题中提到的,我需要具有特殊结构的批次:
111155552222
每个数字代表特征向量.所以每个类有N=4
个向量{1,2,5}
(M=3
),batch size为NxM=12
.
为了完成这个任务,我使用了 Tensorflow Dataset API 和 tfrecords:
- 使用功能构建 tfrecord,每个类 1 个文件
- 为每个类创建 Dataset 实例,并为每个类初始化迭代器
- 为了生成批次,我从迭代器列表中采样
M
个随机迭代器,并从每个迭代器生成N
个特征向量 - 然后我将功能堆叠在一起
- ...
- 批量准备
我担心的是,我有数百个(可能还有数千个)类,并且为每个类存储迭代器看起来不太好(从内存和性能的角度来看).
有没有更好的方法?
如果你有按类排序的文件列表,你可以交错数据集:
将 tensorflow 导入为 tfN = 4record_files = ['class1.tfrecord', 'class5.tfrecord', 'class2.tfrecord']M = len(record_files)数据集 = tf.data.Dataset.from_tensor_slices(record_files)# 考虑 tf.contrib.data.parallel_interleave 进行并行化数据集 = dataset.interleave(tf.data.TFRecordDataset, cycle_length=M, block_length=N)# 考虑传递 num_parallel_calls 或使用 tf.contrib.data.map_and_batch 来提高性能数据集 = dataset.map(parse_function)数据集 = dataset.batch(N * M)
如果你还需要洗牌,你可以在交错步骤中添加:
将 tensorflow 导入为 tfN = 4record_files = ['class1.tfrecord', 'class5.tfrecord', 'class2.tfrecord']M = len(record_files)SHUFFLE_BUFFER_SIZE = 1000数据集 = tf.data.Dataset.from_tensor_slices(record_files)数据集 = dataset.interleave(lambda record_file: tf.data.TFRecordDataset(record_file).shuffle(SHUFFLE_BUFFER_SIZE),cycle_length=M,block_length=N)数据集 = dataset.map(parse_function)数据集 = dataset.batch(N * M)
注意:两者都interleave
和 batch
如果没有更多剩余元素,则生成部分"输出(参见文档).因此,如果每个批次具有相同的形状和结构对您来说很重要,您就必须特别小心.至于批处理,您可以使用 tf.contrib.data.batch_and_drop_remainder
,但据我所知,没有类似的交错替代方法,因此您要么必须确保所有文件都具有相同数量的示例,要么只需添加 repeat
到交错变换.>
编辑 2:
我得到了我认为你想要的东西的概念证明:
将 tensorflow 导入为 tfNUM_EXAMPLES = 12NUM_CLASSES = 9记录 = [[str(i)] * NUM_EXAMPLES for i in range(NUM_CLASSES)]米 = 3N = 4数据集 = tf.data.Dataset.from_tensor_slices(记录)数据集 = dataset.interleave(tf.data.Dataset.from_tensor_slices,cycle_length=NUM_CLASSES,block_length=N)数据集 = dataset.apply(tf.contrib.data.batch_and_drop_remainder(NUM_CLASSES * N))数据集 = dataset.flat_map(lambda 数据:tf.data.Dataset.from_tensor_slices(tf.split(tf.random_shuffle(tf.reshape(data, (NUM_CLASSES, N))), NUM_CLASSES//M)))数据集 = dataset.map(lambda 数据:tf.reshape(data, (M * N,)))批处理 = dataset.make_one_shot_iterator().get_next()使用 tf.Session() 作为 sess:为真:尝试:b = sess.run(batch)打印(b''.join(b).decode())除了 tf.errors.OutOfRangeError: break
输出:
8888666663333555544447777222200001111222288887777666655555333300004444111188882222255556666000044447777333331111
记录文件的等价物将是这样的(假设记录是一维向量):
将 tensorflow 导入为 tfNUM_CLASSES = 9record_files = ['class{}.tfrecord'.format(i) for i in range(NUM_CLASSES)]米 = 3N = 4SHUFFLE_BUFFER_SIZE = 1000数据集 = tf.data.Dataset.from_tensor_slices(record_files)数据集 = dataset.interleave(lambda 文件名:tf.data.TFRecordDataset(file_name).shuffle(SHUFFLE_BUFFER_SIZE),cycle_length=NUM_CLASSES,block_length=N)数据集 = dataset.apply(tf.contrib.data.batch_and_drop_remainder(NUM_CLASSES * N))数据集 = dataset.flat_map(lambda 数据:tf.data.Dataset.from_tensor_slices(tf.split(tf.random_shuffle(tf.reshape(data, (NUM_CLASSES, N, -1))), NUM_CLASSES//M)))数据集 = dataset.map(lambda 数据:tf.reshape(data, (M * N, -1)))
这是通过每次读取每个类的 N
个元素并混洗和拆分结果块来实现的.它假设类的数量可以被 M
整除,并且所有文件都具有相同数量的记录.
As I mentioned in the title I need batches with special structure:
1111
5555
2222
Each digit represent feature-vector. So there are N=4
vectors of each classes {1,2,5}
(M=3
) and batch size is NxM=12
.
To accomplish this task I'm using Tensorflow Dataset API and tfrecords:
- build tfrecord with features, 1 file for each class
- create instance of Dataset for each class, and initialise iterator for each of them
- to produce batch I sample
M
random iterators from list of iterators and from each iterator produceN
feature-vectors - then I stack features together
- ...
- batch ready
My concern is that I have hundreds (and maybe thousands in the feature) of classes and storing iterator for each class doesn't look good (from memory and performance perspective).
Is there a better way?
If you have the list of files ordered by class, you can interleave the datasets:
import tensorflow as tf
N = 4
record_files = ['class1.tfrecord', 'class5.tfrecord', 'class2.tfrecord']
M = len(record_files)
dataset = tf.data.Dataset.from_tensor_slices(record_files)
# Consider tf.contrib.data.parallel_interleave for parallelization
dataset = dataset.interleave(tf.data.TFRecordDataset, cycle_length=M, block_length=N)
# Consider passing num_parallel_calls or using tf.contrib.data.map_and_batch for performance
dataset = dataset.map(parse_function)
dataset = dataset.batch(N * M)
EDIT:
If you need also shuffling you can add it in the interleaving step:
import tensorflow as tf
N = 4
record_files = ['class1.tfrecord', 'class5.tfrecord', 'class2.tfrecord']
M = len(record_files)
SHUFFLE_BUFFER_SIZE = 1000
dataset = tf.data.Dataset.from_tensor_slices(record_files)
dataset = dataset.interleave(
lambda record_file: tf.data.TFRecordDataset(record_file).shuffle(SHUFFLE_BUFFER_SIZE),
cycle_length=M, block_length=N)
dataset = dataset.map(parse_function)
dataset = dataset.batch(N * M)
NOTE: Both interleave
and batch
will produce "partial" outputs if there are no more remaining elements (see docs). So you would have to take special care if it is important for you that every batch has the same shape and structure. As for batching, you can use tf.contrib.data.batch_and_drop_remainder
, but as far as I know there is not a similar alternative for interleaving, so you would either have to make sure that all of your files have the same number of examples or just add repeat
to the interleaving transformation.
EDIT 2:
I got a proof of concept of something like what I think you want:
import tensorflow as tf
NUM_EXAMPLES = 12
NUM_CLASSES = 9
records = [[str(i)] * NUM_EXAMPLES for i in range(NUM_CLASSES)]
M = 3
N = 4
dataset = tf.data.Dataset.from_tensor_slices(records)
dataset = dataset.interleave(tf.data.Dataset.from_tensor_slices,
cycle_length=NUM_CLASSES, block_length=N)
dataset = dataset.apply(tf.contrib.data.batch_and_drop_remainder(NUM_CLASSES * N))
dataset = dataset.flat_map(
lambda data: tf.data.Dataset.from_tensor_slices(
tf.split(tf.random_shuffle(
tf.reshape(data, (NUM_CLASSES, N))), NUM_CLASSES // M)))
dataset = dataset.map(lambda data: tf.reshape(data, (M * N,)))
batch = dataset.make_one_shot_iterator().get_next()
with tf.Session() as sess:
while True:
try:
b = sess.run(batch)
print(b''.join(b).decode())
except tf.errors.OutOfRangeError: break
Output:
888866663333
555544447777
222200001111
222288887777
666655553333
000044441111
888822225555
666600004444
777733331111
The equivalent with record files would be something like this (assuming records are one-dimensional vectors):
import tensorflow as tf
NUM_CLASSES = 9
record_files = ['class{}.tfrecord'.format(i) for i in range(NUM_CLASSES)]
M = 3
N = 4
SHUFFLE_BUFFER_SIZE = 1000
dataset = tf.data.Dataset.from_tensor_slices(record_files)
dataset = dataset.interleave(
lambda file_name: tf.data.TFRecordDataset(file_name).shuffle(SHUFFLE_BUFFER_SIZE),
cycle_length=NUM_CLASSES, block_length=N)
dataset = dataset.apply(tf.contrib.data.batch_and_drop_remainder(NUM_CLASSES * N))
dataset = dataset.flat_map(
lambda data: tf.data.Dataset.from_tensor_slices(
tf.split(tf.random_shuffle(
tf.reshape(data, (NUM_CLASSES, N, -1))), NUM_CLASSES // M)))
dataset = dataset.map(lambda data: tf.reshape(data, (M * N, -1)))
This works by reading N
elements of every class each time and shuffling and splitting the resulting block. It assumes that the number of classes is divisible by M
and that all the files have the same number of records.
这篇关于构建 tensorflow 数据集迭代器,生成具有特殊结构的批次的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!