各种 TensorFlow 数据加载习惯用法适用于哪些场景? [英] What are the scenarios for which the various TensorFlow data loading idioms apply?

查看:37
本文介绍了各种 TensorFlow 数据加载习惯用法适用于哪些场景?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 TensorFlow 深度学习工作流程,其中我有一个使用常规 NumPy 构建的相当简单的数据读取和馈送管道;但我看到 TensorFlow 提供了大量用于加载数据和构建数据管道的函数.我想知道这些目标是什么场景.好像有两个:

I have a TensorFlow deep learning workflow in which I have a fairly simple data reading and feeding pipeline built using regular NumPy; but I see that TensorFlow offers a large number of functions for loading data and building a data pipeline. I wonder though what scenarios these target. It seems there are two:

  1. 涉及非常大的现实世界数据集的学习,以及
  2. 使用高级TensorFlow API构建的网络.
  1. learning that involves very large real world datasets, and
  2. networks built with the with the high-level TensorFlow API.

似乎使用阅读"而不是喂食"的好处(例如像 tf.train.shuffle_batch 这样的函数,但即使是像 tf.one_hot 这样的简单助手)也适用于前者,而 诸如输入函数之类的东西,似乎专门针对后者.

It seems that the benefits of using "reading" as opposed to "feeding" (e.g. functions such as tf.train.shuffle_batch, but even simple helpers like tf.one_hot) apply to the former, while much of the documentation for things like input functions, seems targeted exclusively at the latter.

在我的例子中,我处理的文件可以使用标准 Python 轻松读取,并且可以一次有效地加载到内存中,只需使用 np.arrays 即可正常工作,其他方法要么看起来太复杂(读取管道;实际上很慢,以至于我已经厌倦了它们)或不合适(高级"API,因为我主要使用低级"API).

In my case, where I'm working with files that can easily be read using standard Python and can efficiently be loaded into memory all at once, feeding simply using np.arrays works fine, and the other approaches either seem far too complex (reading pipelines; which are actually quite slow, to the extent I've tired them) or inappropriate (the "high-level" API, given that I mostly use the "low level" API).

鉴于我已经正确地使用了 TensorFlow(而不是低级学习 API)并且提供 NumPy 数组满足我的需求,我的结论是否正确,没有充分的理由打扰任何一种替代方法?可以说这些方法的预期目标与我的不同吗?

Am I correct in concluding that given that I already use TensorFlow proper (not the low level learning API) and that feeding NumPy arrays meets my needs, there's no good reason to bother with either of the alternative approaches? Is it fair to say that the intended targets for these approaches are different from mine?

或者是否有另一种分类法可以更好地考虑各种 TensorFlow 数据加载习惯用法以及它们适用的场景?

Or is there a another taxonomy that better factors the various TensorFlow data loading idioms, and the scenarios to which they apply?

推荐答案

Yaroslav 已经告诉你关于 喂食队列并涉及数据集.只是我自己的一些想法:

Yaroslav already told you about feeding, queues and touched upon datasets. Just a few of my own thoughts:

  • 如果您只想学习 TF 或想快速尝试各种模型,feed_dict 为您提供了一种快速的方法.有性能下降,这就是为什么会有队列
  • 队列允许您指定绕过 python -> native_TF -> python 循环和 GIL 的 TF 操作.队列的一个大问题是它们很难使用(在能够正确使用我的数据之前,我总是很挣扎).许多其他人都在挣扎,您可以在此处看到一些问题示例
  • if you just want to learn TF or want to quickly experiment with your various models, feed_dict provides you a quick way to do this. There is performance downside and this is why there are queues
  • queues allow you to specify TF ops which bypass python -> native_TF -> python loop and GIL. The big problem with queues is that they are hard to use (I always struggled a lot before being able to use my data correctly). Many other people struggled and you can see some examples of problems here

新引入的数据集(用于某些原因官方网站没有链接,可能会在TF 1.3中添加)解决很多问题.它们非常易于使用(检查页面末尾的示例)并且代码非常简单和简短.下面是一个例子:

Newly introduced Datasets (for some reason there is no link from official website, probably will be added with TF 1.3) solve many of the problems. They are very easy to use (check examples at the end of the page) and the code is very simple and short. Here is an example:

def parser(record):
    # parse the record with tf.parse_single_example

iterator = tf.contrib.data.TFRecordDataset(
    glob.glob("data/tfrecords/training_shard_*.tfrecord")
).map(parser).batch(batch_size).shuffle(shuffle_num).repeat(repeat_num).make_initializable_iterator()
next_element = iterator.get_next()

...
with tf.Session() as sess:
    sess.run(iterator.initializer)

    for i in xrange(100000):
        sess.run(next_element)

这几行可以用队列代替 X4 行.使其工作也比队列更容易(几乎和 feed_dict 一样简单).所以现在我的观点是不再有排队的地方了.使用 feed_dict 或数据集.

These few lines were able to substitute X4 lines with queues. Also making it works is easier than queues (almost as easy as feed_dict). So now my opinion is that there is no place for queues any more. Either use feed_dict or datasets.

这篇关于各种 TensorFlow 数据加载习惯用法适用于哪些场景?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆