并行化 tf.data.Dataset.from_generator [英] parallelising tf.data.Dataset.from_generator

查看:38
本文介绍了并行化 tf.data.Dataset.from_generator的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个非常重要的输入管道,from_generator 非常适合......

dataset = tf.data.Dataset.from_generator(complex_img_label_generator,(tf.int32, tf.string))数据集 = dataset.batch(64)iter = dataset.make_one_shot_iterator()imgs, 标签 = iter.get_next()

其中 complex_img_label_generator 动态生成图像并返回一个表示 (H, W, 3) 图像和简单 string 标签的 numpy 数组.处理不是我可以表示为从文件读取和 tf.image 操作.

我的问题是如何并行化生成器?我如何让 N 个生成器在它们自己的线程中运行.

一个想法是使用 dataset.mapnum_parallel_calls 来处理线程;但是地图在张量上运行......另一个想法是创建多个生成器,每个生成器都有自己的 prefetch 并以某种方式加入它们,但我看不出如何加入 N 个生成器流?

我可以遵循任何规范的例子吗?

解决方案

结果证明我可以使用 Dataset.map 如果我使生成器超轻量级(仅生成元数据),然后将实际的重光照移动到无状态函数中.这样我就可以使用 py_func 将繁重的部分与 .map 并行化.

作品;但感觉有点笨拙......能够将 num_parallel_calls 添加到 from_generator 会很棒:)

def pure_numpy_and_pil_complex_calculation(metadata, label):# 一些复杂的 pil 和 numpy 与 tf 无关...数据集 = tf.data.Dataset.from_generator(lightweight_generator,output_types=(tf.string, # 元数据tf.string)) # 标签defwrapped_complex_calulation(元数据,标签):返回 tf.py_func(func = pure_numpy_and_pil_complex_calculation,inp =(元数据,标签),Tout = (tf.uint8, # (H,W,3) imgtf.string)) # 标签数据集 = dataset.map(wrapped_complex_calulation,num_parallel_calls=8)数据集 = dataset.batch(64)iter = dataset.make_one_shot_iterator()imgs, 标签 = iter.get_next()

I have a non trivial input pipeline that from_generator is perfect for...

dataset = tf.data.Dataset.from_generator(complex_img_label_generator,
                                        (tf.int32, tf.string))
dataset = dataset.batch(64)
iter = dataset.make_one_shot_iterator()
imgs, labels = iter.get_next()

Where complex_img_label_generator dynamically generates images and returns a numpy array representing a (H, W, 3) image and a simple string label. The processing not something I can represent as reading from files and tf.image operations.

My question is about how to parallise the generator? How do I have N of these generators running in their own threads.

One thought was to use dataset.map with num_parallel_calls to handle the threading; but the map operates on tensors... Another thought was to create multiple generators each with it's own prefetch and somehow join them, but I can't see how I'd join N generator streams?

Any canonical examples I could follow?

解决方案

Turns out I can use Dataset.map if I make the generator super lightweight (only generating meta data) and then move the actual heavy lighting into a stateless function. This way I can parallelise just the heavy lifting part with .map using a py_func.

Works; but feels a tad clumsy... Would be great to be able to just add num_parallel_calls to from_generator :)

def pure_numpy_and_pil_complex_calculation(metadata, label):
  # some complex pil and numpy work nothing to do with tf
  ...

dataset = tf.data.Dataset.from_generator(lightweight_generator,
                                         output_types=(tf.string,   # metadata
                                                       tf.string))  # label

def wrapped_complex_calulation(metadata, label):
  return tf.py_func(func = pure_numpy_and_pil_complex_calculation,
                    inp = (metadata, label),
                    Tout = (tf.uint8,    # (H,W,3) img
                            tf.string))  # label
dataset = dataset.map(wrapped_complex_calulation,
                      num_parallel_calls=8)

dataset = dataset.batch(64)
iter = dataset.make_one_shot_iterator()
imgs, labels = iter.get_next()

这篇关于并行化 tf.data.Dataset.from_generator的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆