关于 .shuffle、.batch 和 .repeat 的 Tensorflow 数据集问题 [英] Tensorflow dataset questions about .shuffle, .batch and .repeat

查看:54
本文介绍了关于 .shuffle、.batch 和 .repeat 的 Tensorflow 数据集问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个关于使用 tf.Dataset 使用批处理、重复和随机播放的问题.

I had a question about the use of batch, repeat and shuffle with tf.Dataset.

我不清楚如何使用重复和随机播放.我知道 .batch 将决定有多少训练样本将经历随机梯度下降,.repeat.shuffle 的用途仍然不清楚给我.

It is not clear to me exactly how repeat and shuffle are used. I understand that .batch will dictate how many training examples will undergo stochastic gradient descent, the uses of .repeat and .shuffle are still not clear to me.

第一个问题

即使查看了此处这里,.repeat 用于在抛出 tf.errors.OutOfRangeError 后重复数据集.因此,在我的代码中,这是否意味着我不再需要实现:

Even after reviewing here and here, .repeat is used to reiterate over the dataset once a tf.errors.OutOfRangeError is thrown. Therefore, in my code does that mean I no longer have to implement:

try:
    while True:
        _ = sess.run(self.optimizer)

except tf.errors.OutOfRangeError:
        pass

因为.repeat 数据集用完后会自动重复吗?什么时候停止?或者它永远不会停止,一旦经过一定数量的批次(例如 1000),您就必须退出 while True 循环?

because .repeat will automatically repeat the dataset once it is exhausted? When does it stop? or will it never stop and you just have to exit out of the while True loop once a certain number of batches (say 1000) have passed?

第二个问题

其次,使用 .shuffle 对我来说没有意义..shuffle.batch() 是否暗示我有 100,000 个样本,将 1000 个随机放入 .shuffle 的缓冲区中,然后批量说,其中 100 个使用 <代码>.batch().根据我的理解,下一批将使用这些样本中的 999 个,并在缓冲区中放置 1 个新样本.所以如果我的样本对它们没有顺序,那么 .shuffle 应该一起避免吗?如果使用 .batch,它仍然会从缓冲区中的 999+1 中批量处理 100 个?

Secondly, the use .shuffle makes no sense to me. Does .shuffle.batch() imply that I have, say, 100,000 samples, put 1000 randomly in a buffer with .shuffle, then batch say, 100 of them with .batch(). From my understanding the next batch will use 999 of those samples and place 1 new one in the buffer. So if my samples have no order to them, then .shuffle should be avoided all together? And if .batch is used, it would still batch 100 from those 999+1 in the buffer?

第三个问题

最后,如果我使用单独的 td.dataset 对象进行测试,我应该考虑 .shuffle.batch() 的什么顺序?现在我使用:

And lastly, if I am using a separate td.dataset object for testing, what order of .shuffle.batch() should I consider? Right now I use:

sess.run(self.test_init)
try:
    while True:
        accuracy_batch = sess.run(self.accuracy)

except tf.errors.OutOfRangeError:
    pass

与:

test_data = self.test_dataset.shuffle(self.batch_size).batch(self.batch_size)

我有超过 110,000 个训练示例可供我使用,因此 self.batch_size 将设置我想用来测试我的准确性的样本数量.所以,如果我只想在整个测试数据集上进行测试,我不会使用 .batch?但是由于我使用 while True 迭代整个数据集,所以没有区别吗?随着 .shuffle 的使用,我注意到我的准确度发生了变化,但没有它它们非常相似.这让我觉得 .shuffle 正在随机化批次并且可能会重复使用训练示例?

I have over 110,000 training examples at my disposal, so self.batch_size will set the number of samples I want to use to test my accuracy. So, if I wanted to just test on the whole test dataset I wouldn't use .batch? But since I have it iterating over the whole dataset with while True, it makes no difference? With the use of .shuffle I noticed my accuracies changed, but without it they were very similar. This makes me think .shuffleis randomizing the batch and may be reusing training examples?

推荐答案

第一个问题:

这是正确的 - 如果您提供数据集,则不再需要捕获 OutOfRangeError.

repeat() 接受一个可选参数来表示它应该重复的次数.这意味着 repeat(10) 将在整个数据集上迭代 10 次.如果您选择省略参数,那么它将无限重复

repeat() takes an optional argument for the number of times it should repeat. This means repeat(10) will iterate over the entire dataset 10 times. If you choose to omit the argument then it will repeat indefinately

Shuffle()(如果使用)应该在 batch() 之前被调用 - 我们想要混洗记录而不是批次.

Shuffle() (if used) should be called before batch() - we want to shuffle records not batches.

缓冲区首先通过按顺序添加记录来填充,然后,一旦填满,随机选择并发出一个记录,并从原始源读取新记录.

The buffer is first filled by adding your records in order then, once full, a random one is selected and emitted and a new record read from the original source.

如果你有类似的东西

ds.shuffle(1000).batch(100)

然后为了返回单个批次,将最后一步重复 100 次(将缓冲区保持在 1000).批处理是一个单独的操作.

then in order to return a single batch, this last step is repeated 100 times (maintaining the buffer at 1000). Batching is a separate operation.

通常我们根本不打乱测试集 - 只打乱训练集(我们还是使用整个测试集进行评估,对吧?那为什么要打乱呢?).

Generally we don't shuffle a test set at all - only the training set (We evaluate using the entire test set anyway, right? So why shuffle?).

所以,如果我只想在整个测试数据集上进行测试,我不会使用.batch

So, if I wanted to just test on the whole test dataset I wouldn't use .batch

嗯 - 并非如此(至少并非总是如此).如果您的整个测试数据集不适合内存,您当然需要使用批处理 - 这是一种常见的情况.您可能想要测试整个数据集,但要以可管理的方式运行数字!

Hmm - not so (at least not always). You would certainly need to use batch if your whole test dataset didnt fit into memory - a common occurrence. You would want to test the whole dataset but to run the numbers in manageable bites!

这篇关于关于 .shuffle、.batch 和 .repeat 的 Tensorflow 数据集问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆