如何在 TensorFlow 中使用 parallel_interleave [英] How to use parallel_interleave in TensorFlow

查看:79
本文介绍了如何在 TensorFlow 中使用 parallel_interleave的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在阅读 TensorFlow 中的代码 #Lreferrmarks/data_utils.py">#Lrefers/a>.以下代码是从 TFRecord 文件创建 TensorFlow 数据集的部分:

I am reading the code in TensorFlow benchmarks repo. The following piece of code is the part that creates TensorFlow dataset from TFRecord files:

ds = tf.data.TFRecordDataset.list_files(tfrecord_file_names)
ds = ds.apply(interleave_ops.parallel_interleave(tf.data.TFRecordDataset, cycle_length=10))

我正在尝试更改此代码以直接从 JPEG 图像文件创建数据集:

I am trying to change this code to create dataset directly from JPEG image files:

ds = tf.data.Dataset.from_tensor_slices(jpeg_file_names)
ds = ds.apply(interleave_ops.parallel_interleave(?, cycle_length=10))

我不知道该写什么?地方.parallel_interleave() 中的 map_func 是 TFRecord 文件的 tf.data.TFRecordDataset 类的 __init__() ,但是我不知道 JPEG 文件要写什么.

I don't know what to write in the ? place. The map_func in parallel_interleave() is __init__() of tf.data.TFRecordDataset class for TFRecord files, but I don't know what to write for JPEG files.

我们不需要在这里做任何转换.因为我们将压缩两个数据集,然后再进行转换.代码如下:

We don't need to do any transformations here. Because we will zip two datasets and then do the transformations later. The code is as follows:

counter = tf.data.Dataset.range(batch_size)
ds = tf.data.Dataset.zip((ds, counter))
ds = ds.apply( \
     batching.map_and_batch( \
     map_func=preprocess_fn, \
     batch_size=batch_size, \
     num_parallel_batches=num_splits))

因为我们不需要在 ?地方,我尝试使用一个空的 map_func,但是出现错误map_funcmust return aDataset` object".我也尝试过使用tf.data.Dataset,但是输出说Dataset是一个不允许放在那里的抽象类.

Because we don't need transformation in ? place, I tried to use an empty map_func, but there is error "map_funcmust return aDataset` object". I also tried to use tf.data.Dataset, but the output says Dataset is an abstract class that is not allowed to put there.

有人可以帮忙吗?非常感谢.

Anyone can help this? Thanks very much.

推荐答案

parallel_interleave 在您进行将 source 数据集的每个元素转换为多个元素的转换时很有用进入目的地数据集.我不知道他们为什么要在基准测试库中使用它,因为他们可以只使用带有并行调用的 map.

parallel_interleave is useful when you have a transformation that transforms each element of a source dataset into multiple elements into the destination dataset. I'm not sure why they use it in the benchmarks repo like that, when they could have just used a map with parallel calls.

以下是我建议使用 parallel_interleave 从多个目录读取图像的方法,每个目录包含一个类:

Here's how I suggest using parallel_interleave for reading images from several directories, each containing one class:

classes = sorted(glob(directory + '/*/')) # final slash selects directories only
num_classes = len(classes)

labels = np.arange(num_classes, dtype=np.int32)

dirs = DS.from_tensor_slices((classes, labels))               # 1
files = dirs.apply(tf.contrib.data.parallel_interleave(
    get_files, cycle_length=num_classes, block_length=4,      # 2
    sloppy=False)) # False is important ! Otherwise it mixes labels
files = files.cache()
imgs = files.map(read_decode, num_parallel_calls=20)\.        # 3
            .apply(tf.contrib.data.shuffle_and_repeat(100))\
            .batch(batch_size)\
            .prefetch(5)

分为三个步骤.首先,我们获得目录列表及其标签(#1).

There are three steps. First, we get the list of directories and their labels (#1).

然后,我们将这些映射到文件数据集.但是如果我们做一个简单的.flatmap(),我们最终会得到标签0的所有文件,然后是标签1的所有文件code>,然后是 2 等等......然后我们需要非常大的 shuffle 缓冲区来获得有意义的 shuffle.

Then, we map these to a dataset of files. But if we do a simple .flatmap(), we will end up with all the files of label 0, followed by all the files of label 1, then 2 etc ... Then we'd need really large shuffle buffers to get a meaningful shuffle.

因此,我们改为应用 parallel_interleave (#2).这是get_files():

So, instead, we apply parallel_interleave (#2). Here is the get_files():

def get_files(dir_path, label):
    globbed = tf.string_join([dir_path, '*.jpg'])
    files = tf.matching_files(globbed)

    num_files = tf.shape(files)[0] # in the directory
    labels = tf.tile([label], [num_files, ]) # expand label to all files
    return DS.from_tensor_slices((files, labels))

使用 parallel_interleave 确保每个目录的 list_files 并行运行,因此当第一个 block_length 文件从第一个列出时目录中,第 2 个目录中的第一个 block_length 文件也将可用(也来自 3rd、4th 等).此外,生成的数据集将包含每个标签的交错块,例如1 1 1 1 2 2 2 2 3 3 3 3 3 1 1 1 1 ...(对于 3 个类和 block_length=4)

Using parallel_interleave ensures the list_files of each directory is run in parallel, so by the time the first block_length files are listed from the first directory, the first block_length files from the 2nd directory will also be available (also from 3rd, 4th etc). Moreover, the resulting dataset will contain interleaved blocks of each label, e.g. 1 1 1 1 2 2 2 2 3 3 3 3 3 1 1 1 1 ... (for 3 classes and block_length=4)

最后,我们从文件列表 (#3) 中读取图像.这是read_and_decode():

Finally, we read the images from the list of files (#3). Here is read_and_decode():

def read_decode(path, label):
    img = tf.image.decode_image(tf.read_file(path), channels=3)
    img = tf.image.resize_bilinear(tf.expand_dims(img, axis=0), target_size)
    img = tf.squeeze(img, 0)
    img = preprocess_fct(img) # should work with Tensors !

    label = tf.one_hot(label, num_classes)
    img = tf.Print(img, [path, label], 'Read_decode')
    return (img, label)

此函数采用图像路径及其标签,并为每个返回一个张量:路径的图像张量和标签的 one_hot 编码.这也是您可以对图像进行所有转换的地方.在这里,我会调整大小和进行基本的预处理.

This function takes an image path and its label and returns a tensor for each: image tensor for the path, and one_hot encoding for the label. This is also the place where you can do all the transformations on the image. Here, I do resizing and basic pre-processing.

这篇关于如何在 TensorFlow 中使用 parallel_interleave的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆