Dataset.map、Dataset.prefetch 和 Dataset.shuffle 中 buffer_size 的含义 [英] Meaning of buffer_size in Dataset.map , Dataset.prefetch and Dataset.shuffle

查看:97
本文介绍了Dataset.map、Dataset.prefetch 和 Dataset.shuffle 中 buffer_size 的含义的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据 TensorFlow 文档预取tf.contrib.data.Dataset 类的 map 方法,都有一个名为 buffer_size 的参数.

As per TensorFlow documentation , the prefetch and map methods of tf.contrib.data.Dataset class, both have a parameter called buffer_size.

对于 prefetch 方法,该参数称为 buffer_size 并且根据文档:

For prefetch method, the parameter is known as buffer_size and according to documentation :

buffer_size:一个 tf.int64 标量 tf.Tensor,代表最大值预取时缓冲的元素数量.

buffer_size: A tf.int64 scalar tf.Tensor, representing the maximum number elements that will be buffered when prefetching.

对于 map 方法,该参数称为 output_buffer_size 并且根据文档:

For the map method, the parameter is known as output_buffer_size and according to documentation :

output_buffer_size:(可选)一个 tf.int64 标量 tf.Tensor,表示将被处理的最大元素数缓冲.

output_buffer_size: (Optional.) A tf.int64 scalar tf.Tensor, representing the maximum number of processed elements that will be buffered.

shuffle 方法类似,根据文档显示相同的数量:

Similarly for the shuffle method, the same quantity appears and according to documentation :

buffer_size:一个 tf.int64 标量 tf.Tensor,代表此数据集中的元素,新数据集将从中采样.

buffer_size: A tf.int64 scalar tf.Tensor, representing the number of elements from this dataset from which the new dataset will sample.

这些参数之间的关系是什么?

What is the relation between these parameters ?

假设我创建了一个Dataset 对象,如下所示:

Suppose I create aDataset object as follows :

 tr_data = TFRecordDataset(trainfilenames)
    tr_data = tr_data.map(providefortraining, output_buffer_size=10 * trainbatchsize, num_parallel_calls\
=5)
    tr_data = tr_data.shuffle(buffer_size= 100 * trainbatchsize)
    tr_data = tr_data.prefetch(buffer_size = 10 * trainbatchsize)
    tr_data = tr_data.batch(trainbatchsize)

上述代码段中的 buffer 参数起什么作用?

What role is being played by the buffer parameters in the above snippet ?

推荐答案

TL;DR 尽管名称相似,但这些参数的含义却大相径庭.Dataset.shuffle() 中的 buffer_size 会影响数据集的随机性,从而影响生成元素的顺序.Dataset.prefetch() 中的 buffer_size 只影响生成下一个元素所需的时间.

TL;DR Despite their similar names, these arguments have quite difference meanings. The buffer_size in Dataset.shuffle() can affect the randomness of your dataset, and hence the order in which elements are produced. The buffer_size in Dataset.prefetch() only affects the time it takes to produce the next element.

tf.data.Dataset.prefetch()tf.contrib.data.Dataset.map() 提供了一种调整performance 输入管道:这两个参数都告诉 TensorFlow 创建一个最多包含 buffer_size 元素的缓冲区,以及一个后台线程来在后台填充该缓冲区.(请注意,当它从 tf.contrib.data 移动到 tf.data.新代码应该在 map() 之后使用 Dataset.prefetch() 以获得相同的行为.)

The buffer_size argument in tf.data.Dataset.prefetch() and the output_buffer_size argument in tf.contrib.data.Dataset.map() provide a way to tune the performance of your input pipeline: both arguments tell TensorFlow to create a buffer of at most buffer_size elements, and a background thread to fill that buffer in the background. (Note that we removed the output_buffer_size argument from Dataset.map() when it moved from tf.contrib.data to tf.data. New code should use Dataset.prefetch() after map() to get the same behavior.)

添加预取缓冲区可以通过将数据预处理与下游计算重叠来提高性能.通常,在管道的最后添加一个小的预取缓冲区(可能只有一个元素)是最有用的,但是更复杂的管道可以从额外的预取中受益,尤其是当生成单个元素的时间可能会发生变化时.

Adding a prefetch buffer can improve performance by overlapping the preprocessing of data with downstream computation. Typically it is most useful to add a small prefetch buffer (with perhaps just a single element) at the very end of the pipeline, but more complex pipelines can benefit from additional prefetching, especially when the time to produce a single element can vary.

相比之下,buffer_size 参数rel="noreferrer">tf.data.Dataset.shuffle() 影响转换的随机性.我们设计了 Dataset.shuffle() 转换(如 tf.train.shuffle_batch() 函数(它取代)来处理太大而无法放入内存的数据集.它没有打乱整个数据集,而是维护一个 buffer_size 元素的缓冲区,并从该缓冲区中随机选择下一个元素(用下​​一个输入元素替换它,如果有的话).改变buffer_size 的值会影响shuffle 的均匀程度:如果buffer_size 大于数据集中元素的数量,你会得到一个uniform shuffle;如果它是 1 ,那么你根本没有洗牌.对于非常大的数据集,典型的足够好"的方法是在训练前将数据随机分片到多个文件中,然后均匀地打乱文件名,然后使用较小的打乱缓冲区.但是,适当的选择将取决于您的培训工作的确切性质.

By contrast, the buffer_size argument to tf.data.Dataset.shuffle() affects the randomness of the transformation. We designed the Dataset.shuffle() transformation (like the tf.train.shuffle_batch() function that it replaces) to handle datasets that are too large to fit in memory. Instead of shuffling the entire dataset, it maintains a buffer of buffer_size elements, and randomly selects the next element from that buffer (replacing it with the next input element, if one is available). Changing the value of buffer_size affects how uniform the shuffling is: if buffer_size is greater than the number of elements in the dataset, you get a uniform shuffle; if it is 1 then you get no shuffling at all. For very large datasets, a typical "good enough" approach is to randomly shard the data into multiple files once before training, then shuffle the filenames uniformly, and then use a smaller shuffle buffer. However, the appropriate choice will depend on the exact nature of your training job.

这篇关于Dataset.map、Dataset.prefetch 和 Dataset.shuffle 中 buffer_size 的含义的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆