使用TensorFlow数据集进行批处理,重复和随机播放有什么作用? [英] What does batch, repeat, and shuffle do with TensorFlow Dataset?

查看:88
本文介绍了使用TensorFlow数据集进行批处理,重复和随机播放有什么作用?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在学习TensorFlow,但是我在这段代码中遇到了一个混乱:

I'm currently learning TensorFlow but i come across a confusion within this code:

dataset = dataset.shuffle(buffer_size = 10 * batch_size) 
dataset = dataset.repeat(num_epochs).batch(batch_size)
return dataset.make_one_shot_iterator().get_next()

我首先知道数据集将保存所有数据,但是shuffle(),repeat()和batch()对数据集有什么作用?请给我一个例子的解释

i know first the dataset will hold all the data but what shuffle(),repeat(), and batch() do to the dataset? please give me an explanation with an example

推荐答案

想象一下,您有一个数据集: [1、2 ,3、4、5、6] ,然后:

Imagine, you have a dataset: [1, 2, 3, 4, 5, 6], then:

ds.shuffle()的工作原理

dataset.shuffle(buffer_size = 3)会分配一个大小为3的缓冲区以挑选随机条目。该缓冲区将连接到源数据集。
我们可以这样成像:

dataset.shuffle(buffer_size=3) will allocate a buffer of size 3 for picking random entries. This buffer will be connected to the source dataset. We could image it like this:

Random buffer
   |
   |   Source dataset where all other elements live
   |         |
   ↓         ↓
[1,2,3] <= [4,5,6]

让我们假设条目 2 是从随机缓冲区中提取的。可用空间由源缓冲区中的下一个元素填充,即 4

Let's assume that the entry 2 was taken from the random buffer. Free space is filled by the next element from the source buffer, that is 4:

2 <= [1,3,4] <= [5,6]

我们继续阅读,直到什么都没剩下:

We continue reading till nothing is left:

1 <= [3,4,5] <= [6]
5 <= [3,4,6] <= []
3 <= [4,6]   <= []
6 <= [4]      <= []
4 <= []      <= []

ds.repeat()的工作方式

一旦从数据集中读取了所有条目,然后您尝试读取下一个元素,即数据集会抛出一个错误。
这就是 ds.repeat()发挥作用的地方。它将重新初始化数据集,使其再次如下所示:

As soon as all the entries are read from the dataset and you try to read the next element, the dataset will throw an error. That's where ds.repeat() comes into play. It will re-initialize the dataset, making it again like this:

[1,2,3] <= [4,5,6]

ds.batch()会产生什么

ds.batch()将首先使用 batch_size 条目并从中进行批处理。因此,示例数据集的批处理大小为3将产生两个批处理记录:

The ds.batch() will take first batch_size entries and make a batch out of them. So, batch size of 3 for our example dataset will produce two batch records:

[2,1,5]
[3,6,4]

因为我们有 ds.repeat ()在批处理之前,数据的生成将继续。但是,由于 ds.random(),元素的顺序将有所不同。应该考虑的是,由于随机缓冲区的大小, 6 永远不会出现在第一批中。

As we have a ds.repeat() before the batch, the generation of the data will continue. But the order of the elements will be different, due to the ds.random(). What should be taken into account is that 6 will never be present in the first batch, due to the size of the random buffer.

这篇关于使用TensorFlow数据集进行批处理,重复和随机播放有什么作用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆