Tensorflow 数据 API - 预取 [英] Tensorflow Data API - prefetch

查看:45
本文介绍了Tensorflow 数据 API - 预取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 TF 的新功能,即数据 API,但我不确定预取 的工作原理.在下面的代码中

I am trying to use new features of TF, namely Data API, and I am not sure how prefetch works. In the code below

def dataset_input_fn(...)
    dataset = tf.data.TFRecordDataset(filenames, compression_type="ZLIB")
    dataset = dataset.map(lambda x:parser(...))
    dataset = dataset.map(lambda x,y: image_augmentation(...)
                      , num_parallel_calls=num_threads
                     )

    dataset = dataset.shuffle(buffer_size)
    dataset = dataset.batch(batch_size)    
    dataset = dataset.repeat(num_epochs)
    iterator = dataset.make_one_shot_iterator()

在上面的每一行之间我放置 dataset=dataset.prefetch(batch_size) 有关系吗?或者,如果数据集来自 tf.contrib.data,那么它应该在每次使用 output_buffer_size 的操作之后?

does it matter between each lines above I put dataset=dataset.prefetch(batch_size)? Or maybe it should be after every operation that would be using output_buffer_size if the dataset was coming from tf.contrib.data?

推荐答案

正在讨论 github 我找到了 mrry 的评论:

In discussion on github I found a comment by mrry:

请注意,在 TF 1.4 中将有一个 Dataset.prefetch() 方法更容易在管道中的任何点添加预取,而不是就在 map() 之后.(您可以通过下载当前的 nightly构建.)

Note that in TF 1.4 there will be a Dataset.prefetch() method that makes it easier to add prefetching at any point in the pipeline, not just after a map(). (You can try it by downloading the current nightly build.)

例如,Dataset.prefetch() 会启动一个后台线程来填充一个类似于 tf.FIFOQueue 的有序缓冲区,以便下游流水线阶段不需要阻塞.但是,预取()实现要简单得多,因为它不需要支持许多不同的并发操作作为 tf.FIFOQueue.

For example, Dataset.prefetch() will start a background thread to populate a ordered buffer that acts like a tf.FIFOQueue, so that downstream pipeline stages need not block. However, the prefetch() implementation is much simpler, because it doesn't need to support as many different concurrent operations as a tf.FIFOQueue.

所以这意味着预取可以由任何命令放置,并且它适用于上一个命令.到目前为止,我已经注意到将它放在最后的最大性能提升.

so it means prefetch could be put by any command and it works on the previous command. So far I have noticed the biggest performance gains by putting it only at the very end.

还有一个关于 Dataset.map 、 Dataset.prefetch 和 Dataset.shuffle 中 buffer_size 的含义 其中,mrry 详细解释了预取和缓冲区.

There is one more discussion on Meaning of buffer_size in Dataset.map , Dataset.prefetch and Dataset.shuffle where mrry explains a bit more about the prefetch and buffer.

更新 2018/10/01:

从 1.7.0 版本开始,Dataset API(在 contrib)有一个 prefetch_to_device 选项.请注意,此转换必须是管道中的最后一个,并且当 TF 2.0 到来时 contrib 将消失.要在多个 GPU 上进行预取工作,请使用 MultiDeviceIterator(示例参见 #13610) multi_device_iterator_ops.py.

From version 1.7.0 Dataset API (in contrib) has an option to prefetch_to_device. Note that this transformation has to be the last in the pipeline and when TF 2.0 arrives contrib will be gone. To have prefetch work on multiple GPUs please use MultiDeviceIterator (example see #13610) multi_device_iterator_ops.py.

https://www.tensorflow.org/版本/master/api_docs/python/tf/contrib/data/prefetch_to_device

这篇关于Tensorflow 数据 API - 预取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆