Tensorflow 数据集数据预处理是对整个数据集进行一次还是对 iterator.next() 的每次调用进行一次? [英] Tensorflow dataset data preprocessing is done once for the whole dataset or for each call to iterator.next()?

查看:39
本文介绍了Tensorflow 数据集数据预处理是对整个数据集进行一次还是对 iterator.next() 的每次调用进行一次?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我现在正在研究 tensorflow 中的数据集 API,我有一个关于执行数据预处理的 dataset.map() 函数的问题.

file_name = ["image1.jpg", "image2.jpg", ......]im_dataset = tf.data.Dataset.from_tensor_slices(file_names)im_dataset = im_dataset.map(lambda image:tuple(tf.py_func(image_parser(), [image], [tf.float32, tf.float32, tf.float32])))im_dataset = im_dataset.batch(batch_size)迭代器 = im_dataset.make_initializable_iterator()

数据集接收图像名称并将它们解析为 3 个张量(3 个关于图像的信息).

如果我的训练文件夹中有大量图像,预处理它们将需要很长时间.我的问题是,由于据说 Dataset API 是为高效的输入管道而设计的,因此在将整个数据集提供给我的工作人员(比如说 GPU)之前,会对整个数据集进行预处理,或者每次我只预处理一批图像调用 iterator.get_next()?

解决方案

如果你的预处理管道很长并且输出很小,那么处理后的数据应该适合内存.如果是这种情况,您可以使用 tf.data.Dataset.cache 将处理后的数据缓存在内存或文件中.

来自官方 performance指南:

<块引用>

tf.data.Dataset.cache 转换可以在内存或本地存储中缓存数据集.如果传递给地图转换的用户定义函数开销很大,只要结果数据集仍然可以放入内存或本地存储,就在地图转换后应用缓存转换.如果用户定义的函数增加了存储数据集所需的空间超过缓存容量,请考虑在训练作业之前预处理您的数据以减少资源使用.

<小时>

在内存中使用缓存的示例

这是一个示例,其中每个预处理都需要花费大量时间(0.5 秒).数据集上的第二个 epoch 将比第一个快得多

def my_fn(x):时间.睡眠(0.5)返回 xdef parse_fn(x):返回 tf.py_func(my_fn, [x], tf.int64)数据集 = tf.data.Dataset.range(5)数据集 = dataset.map(parse_fn)dataset = dataset.cache() # 缓存处理过的数据集,所以每个输入都会被处理一次dataset = dataset.repeat(2) # 重复多个时期res = dataset.make_one_shot_iterator().get_next()使用 tf.Session() 作为 sess:对于范围内的我(10):# 前 5 次迭代每次需要 0.5 秒,后 5 次不会打印(sess.run(res))

<小时>

缓存到文件

如果你想把缓存的数据写入文件,你可以给 cache() 提供一个参数:

dataset = dataset.cache('/tmp/cache') # 将缓存数据写入文件

这将允许您只处理一次数据集,并对数据运行多次实验,而无需再次重新处理.

警告:缓存到文件时必须小心.如果您更改数据,但保留 /tmp/cache.* 文件,它仍将读取缓存的旧数据.例如,如果我们使用上面的数据,将数据的范围更改为[10, 15],我们仍然会在[0, 5]中获取数据>:

dataset = tf.data.Dataset.range(10, 15)数据集 = dataset.map(parse_fn)dataset = dataset.cache('/tmp/cache')dataset = dataset.repeat(2) # 重复多个时期res = dataset.make_one_shot_iterator().get_next()使用 tf.Session() 作为 sess:对于范围内的我(10):print(sess.run(res)) # 仍然在 [0, 5]...

<块引用>

每当您要缓存的数据发生变化时,始终删除缓存的文件.

可能出现的另一个问题是,如果您在缓存所有数据之前中断脚本.你会收到这样的错误:

<块引用>

AlreadyExistsError(回溯见上文):似乎有一个并发缓存迭代器正在运行 - 缓存锁文件已经存在('/tmp/cache.lockfile').如果您确定没有其他正在运行的 TF 计算正在使用此缓存前缀,请删除锁定文件并重新初始化迭代器.

确保你让整个数据集被处理成一个完整的缓存文件.

Hi I am studying the dataset API in tensorflow now and I have a question regarding to the dataset.map() function which performs data preprocessing.

file_name = ["image1.jpg", "image2.jpg", ......]
im_dataset = tf.data.Dataset.from_tensor_slices(file_names)
im_dataset = im_dataset.map(lambda image:tuple(tf.py_func(image_parser(), [image], [tf.float32, tf.float32, tf.float32])))
im_dataset = im_dataset.batch(batch_size)
iterator = im_dataset.make_initializable_iterator()

The dataset takes in image names and parse them into 3 tensors (3 infos about the image).

If I have a very larger number of images in my training folder, preprocessing them is gonna take a long time. My question is that, since Dataset API is said to be designed for efficient input pipeline, the preprocessing is done for the whole dataset before I feed them to my workers (let's say GPUs), or it only preprocess one batch of image each time I call iterator.get_next()?

解决方案

If your preprocessing pipeline is very long and the output is small, the processed data should fit in memory. If this is the case, you can use tf.data.Dataset.cache to cache the processed data in memory or in a file.

From the official performance guide:

The tf.data.Dataset.cache transformation can cache a dataset, either in memory or on local storage. If the user-defined function passed into the map transformation is expensive, apply the cache transformation after the map transformation as long as the resulting dataset can still fit into memory or local storage. If the user-defined function increases the space required to store the dataset beyond the cache capacity, consider pre-processing your data before your training job to reduce resource usage.


Example use of cache in memory

Here is an example where each pre-processing takes a lot of time (0.5s). The second epoch on the dataset will be much faster than the first

def my_fn(x):
    time.sleep(0.5)
    return x

def parse_fn(x):
    return tf.py_func(my_fn, [x], tf.int64)

dataset = tf.data.Dataset.range(5)
dataset = dataset.map(parse_fn)
dataset = dataset.cache()    # cache the processed dataset, so every input will be processed once
dataset = dataset.repeat(2)  # repeat for multiple epochs

res = dataset.make_one_shot_iterator().get_next()

with tf.Session() as sess:
    for i in range(10):
        # First 5 iterations will take 0.5s each, last 5 will not
        print(sess.run(res))


Caching to a file

If you want to write the cached data to a file, you can provide an argument to cache():

dataset = dataset.cache('/tmp/cache')  # will write cached data to a file

This will allow you to only process the dataset once, and run multiple experiments on the data without reprocessing it again.

Warning: You have to be careful when caching to a file. If you change your data, but keep the /tmp/cache.* files, it will still read the old data that was cached. For instance, if we use the data from above and change the range of the data to be in [10, 15], we will still obtain data in [0, 5]:

dataset = tf.data.Dataset.range(10, 15)
dataset = dataset.map(parse_fn)
dataset = dataset.cache('/tmp/cache')
dataset = dataset.repeat(2)  # repeat for multiple epochs

res = dataset.make_one_shot_iterator().get_next()

with tf.Session() as sess:
    for i in range(10):
        print(sess.run(res))  # will still be in [0, 5]...

Always delete the cached files whenever the data that you want to cache changes.

Another issue that may arise is if you interrupt the script before all the data is cached. You will receive an error like this:

AlreadyExistsError (see above for traceback): There appears to be a concurrent caching iterator running - cache lockfile already exists ('/tmp/cache.lockfile'). If you are sure no other running TF computations are using this cache prefix, delete the lockfile and re-initialize the iterator.

Make sure that you let the whole dataset be processed to have an entire cache file.

这篇关于Tensorflow 数据集数据预处理是对整个数据集进行一次还是对 iterator.next() 的每次调用进行一次?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆