各种TensorFlow数据加载惯用法适用于哪些场景? [英] What are the scenarios for which the various TensorFlow data loading idioms apply?

查看:108
本文介绍了各种TensorFlow数据加载惯用法适用于哪些场景?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个TensorFlow深度学习工作流程,其中有一个使用常规NumPy构建的相当简单的数据读取和馈送管道;但是我看到TensorFlow提供了大量用于加载数据和建立数据管道的功能.我想知道这些目标是什么情况.似乎有两个:

I have a TensorFlow deep learning workflow in which I have a fairly simple data reading and feeding pipeline built using regular NumPy; but I see that TensorFlow offers a large number of functions for loading data and building a data pipeline. I wonder though what scenarios these target. It seems there are two:

  1. 涉及非常大的现实世界数据集的学习,并且
  2. 使用高级TensorFlow API 构建的
  3. 网络.

使用阅读"相对于喂食" (例如,诸如tf.train.shuffle_batch之类的功能,但即使是诸如tf.one_hot之类的简单助手也适用于前者,而输入函数之类的东西,似乎专门针对后者.

It seems that the benefits of using "reading" as opposed to "feeding" (e.g. functions such as tf.train.shuffle_batch, but even simple helpers like tf.one_hot) apply to the former, while much of the documentation for things like input functions, seems targeted exclusively at the latter.

在我的情况下,我正在处理可以使用标准Python轻松读取并且可以一次有效地全部加载到内存中的文件,仅使用np.array即可正常工作,而其他方法似乎相去甚远太复杂(读取管道;在我已经厌倦的程度上,这实际上相当慢)或不合适(考虑到我主要使用低级" API,因此是高级" API).

In my case, where I'm working with files that can easily be read using standard Python and can efficiently be loaded into memory all at once, feeding simply using np.arrays works fine, and the other approaches either seem far too complex (reading pipelines; which are actually quite slow, to the extent I've tired them) or inappropriate (the "high-level" API, given that I mostly use the "low level" API).

我的结论是正确的,因为我已经使用过TensorFlow适当的(不是低级学习API)并且提供NumPy数组满足了我的需求,没有充分的理由打扰这两种替代方法吗?可以公平地说,这些方法的预期目标与我的目标不同吗?

Am I correct in concluding that given that I already use TensorFlow proper (not the low level learning API) and that feeding NumPy arrays meets my needs, there's no good reason to bother with either of the alternative approaches? Is it fair to say that the intended targets for these approaches are different from mine?

或者是否存在另一种分类法,可以更好地考虑各种TensorFlow数据加载习惯用法以及它们适用的场景?

Or is there a another taxonomy that better factors the various TensorFlow data loading idioms, and the scenarios to which they apply?

推荐答案

Yaroslav已经向您介绍了 队列并触及数据集.只是我自己的一些想法:

Yaroslav already told you about feeding, queues and touched upon datasets. Just a few of my own thoughts:

  • 如果您只是想学习TF或想快速尝试各种模型,则feed_dict为您提供了一种快速的方法.性能有缺点,这就是为什么有队列的原因
  • 队列允许您指定绕过python-> native_TF-> python循环和GIL的TF op.队列的最大问题是它们难以使用(在能够正确使用我的数据之前,我总是很苦恼).许多其他人都在挣扎,您可以在这里看到一些问题示例
  • if you just want to learn TF or want to quickly experiment with your various models, feed_dict provides you a quick way to do this. There is performance downside and this is why there are queues
  • queues allow you to specify TF ops which bypass python -> native_TF -> python loop and GIL. The big problem with queues is that they are hard to use (I always struggled a lot before being able to use my data correctly). Many other people struggled and you can see some examples of problems here

新引入的数据集(适用于由于某些原因,没有官方网站的链接,可能会在TF 1.3中添加)解决许多问题.它们非常易于使用(请参见页面末尾的示例),并且代码非常简单且简短.这是一个示例:

Newly introduced Datasets (for some reason there is no link from official website, probably will be added with TF 1.3) solve many of the problems. They are very easy to use (check examples at the end of the page) and the code is very simple and short. Here is an example:

def parser(record):
    # parse the record with tf.parse_single_example

iterator = tf.contrib.data.TFRecordDataset(
    glob.glob("data/tfrecords/training_shard_*.tfrecord")
).map(parser).batch(batch_size).shuffle(shuffle_num).repeat(repeat_num).make_initializable_iterator()
next_element = iterator.get_next()

...
with tf.Session() as sess:
    sess.run(iterator.initializer)

    for i in xrange(100000):
        sess.run(next_element)

这几行能够用队列替换X4行.使它工作还比队列容易(几乎和feed_dict一样容易).因此,现在我的观点是不再有排队的地方.使用feed_dict或数据集.

These few lines were able to substitute X4 lines with queues. Also making it works is easier than queues (almost as easy as feed_dict). So now my opinion is that there is no place for queues any more. Either use feed_dict or datasets.

这篇关于各种TensorFlow数据加载惯用法适用于哪些场景?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆