如何缓存和迭代未知大小的数据集? [英] How to cache and iterate through a Dataset of unknown size?

查看:24
本文介绍了如何缓存和迭代未知大小的数据集?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在将 .cache() 步骤添加到我的数据集管道时,连续训练时期仍会从网络存储下载数据.

While adding the .cache() step to my dataset pipeline, successives training epochs still download the data from the network storage.

我在网络存储上有一个数据集.我想缓存它,但不要重复它:一个训练时期必须贯穿整个数据集.这是我的数据集构建管道:

I have a dataset on a network storage. I want to cache it, but not to repeat it: a training epoch must run through the whole dataset. Here is my dataset building pipeline:

return tf.data.Dataset.list_files(
        file_pattern
    ).interleave(
        tf.data.TFRecordDataset,
        num_parallel_calls=tf.data.experimental.AUTOTUNE
    ).shuffle(
        buffer_size=2048
    ).batch(
        batch_size=2048,
        drop_remainder=True,
    ).cache(
    ).map(
        map_func=_parse_example_batch,
        num_parallel_calls=tf.data.experimental.AUTOTUNE
    ).prefetch(
        buffer_size=32
    )

如果我按原样使用它,数据集会在每个 epoch 下载.为了避免这种情况,我必须将 .repeat() 步骤添加到管道并使用 model.fit 函数的 steps_per_epoch 关键字.但是,我不知道完整数据集的大小,因此无法传递正确的 steps_per_epoch 值.

If I use it as is, the dataset is downloaded at each epoch. To avoid this, I have to add the .repeat() step to the pipeline and use the steps_per_epoch keyword of the model.fit function. However, I do not know the size of the complete dataset and thus I cannot pass the right steps_per_epoch value.

缓存和使用未知大小的数据集的正确方法是什么?

What is the right way to cache and use dataset of unknown size?

谢谢.

在阅读一些 TF 代码时,我(重新)发现了<代码>make_initializable_iterator.看来这正是我要找的,也就是说在同一个数据集上迭代多次(利用第一次迭代后的缓存).但是,这已被弃用,不再是 TF2 中主要 API 的一部分.

While reading some TF code, I (re)discovered the make_initializable_iterator. It seems that it is what I am looking for, that is to say iterate multiple times through the same dataset (taking advantage of the cache after the first iteration). However, this is deprecated and no longer part of the main API in TF2.

更新指令是使用for ... in dataset手动迭代Dataset.难道不是keras.Model.fit函数做了什么?我是否需要手动编写训练循环以获得缓存优势?

Updating instruction is to manually iterate over the Dataset with for ... in dataset. Is it not what is done by the keras.Model.fit function? Have I to write the training loop manually to get cache advantages?

亲切.

推荐答案

好消息!最终 v2.0.0 版本修复了此行为.

Good news! Final v2.0.0 release fix this behavior.

这是一个代码片段,用于突出显示不同的行为.

Here is a code snippet to highlight the different behaviors.

import time

import tensorflow as tf
import tensorflow.keras as keras

# Simple layer that just print its inputs
class Print(keras.layers.Layer):

       def compute_output_signature(self, input_signature):
              return input_signature

       def call(self, inputs, **kwargs):
              tf.print(inputs)
              return inputs

# Generator returning incremented values each time it is re-initialized
generator_list = [0]
def generator():
       v = generator_list[-1]
       generator_list.append(v+1)
       tf.print("Generating samples with value {}".format(v))
       time.sleep(2)
       for i in range(2):
              yield (tf.constant([v]), tf.constant(v))


def main():
       model_input = keras.layers.Input(shape=(1,))
       model_output = Print()(model_input)
       model = keras.Model(inputs=model_input, outputs=model_output)
       model.compile("adam", loss="mae")

       ds = tf.data.Dataset.from_generator(
              generator, (tf.int64, tf.int64), ([1], [])
       )
       cached_ds = ds.cache()

       tf.print("Fit")
       model.fit(
              cached_ds,
              epochs=3,
              verbose=2
       )

       tf.print("For ... in ...")
       for i in range(3):
              for x, y in cached_ds:
                     model(x)

if __name__ == '__main__':
    main()

使用 tensorflow 2.0.0-b1(在 Google AI Platform 上使用),输出如下:

With tensorflow 2.0.0-b1 (used on Google AI Platform), here is the output:

Fit
Epoch 1/3
Generating samples with value 0
# sleep 2s
2019-10-03 15:45:32.718522: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1483] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set.  If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU.  To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
[[0]]
[[0]]
2/2 - 2s - loss: 0.0000e+00
Generating samples with value 1
# sleep 2s
Epoch 2/3
[[1]]
[[1]]
2/2 - 2s - loss: 0.0000e+00
Epoch 3/3
2019-10-03 15:45:34.774195: W tensorflow/core/kernels/data/cache_dataset_ops.cc:815] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
Generating samples with value 2
# sleep 2s
[[2]]
[[2]]
2019-10-03 15:45:36.782046: W tensorflow/core/kernels/data/cache_dataset_ops.cc:815] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
2/2 - 2s - loss: 0.0000e+00
For ... in ...
Generating samples with value 3
# sleep 2s
[3]
[3]
Generating samples with value 4
# sleep 2s
[4]
[4]
Generating samples with value 5
# sleep 2s
[5]
[5]

可以看到,tensor 的值在每个 epoch 都会递增,并且每次都执行 sleep 指令.此外,我们收到关于截断迭代器的警告...

You can see, that the value of the tensor is incremented for each epoch, and the sleep instruction is executed each time. Moreover, we get the warning about truncated iterator...

现在,使用 tensorflow 2.0.0:

Now, with tensorflow 2.0.0:

Fit
Epoch 1/3
WARNING:tensorflow:The list of trainable weights is empty. Make sure that you are not setting model.trainable to False before compiling the model.
Generating samples with value 0
# sleep 2s
[[0]]
[[0]]
2019-10-03 15:49:59.587796: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
     [[{{node IteratorGetNext}}]]
2/2 - 2s - loss: 0.0000e+00
Epoch 2/3
[[0]]
[[0]]
2019-10-03 15:49:59.598144: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
     [[{{node IteratorGetNext}}]]
2/2 - 0s - loss: 0.0000e+00
Epoch 3/3
[[0]]
[[0]]
2019-10-03 15:49:59.605260: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
     [[{{node IteratorGetNext}}]]
For ... in ...
2/2 - 0s - loss: 0.0000e+00
[0]
[0]
[0]
[0]
[0]
[0]

还有瞧"!生成器函数只执行一次,不再休眠并且张量的值始终相同.我只是有一些关于序列结束的警告,但我可以支持它!

And 'Voila'! The generator function is executed only once, with no more sleeps and always the same value of the tensor. I just have some warnings about end of sequence, but I can support it!

亲切.

这篇关于如何缓存和迭代未知大小的数据集?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆