使用队列从多个输入文件中统一采样 [英] Using Queues to uniformly sample from multiple input files

查看:23
本文介绍了使用队列从多个输入文件中统一采样的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的数据集中的每个类都有一个序列化文件.我想使用队列来加载这些文件中的每一个,然后将它们放在一个 RandomShuffleQueue 中,它将把它们拉下来,这样我就可以从每个类中随机混合示例.我认为这段代码会起作用.

I have one serialized file for each class in my dataset. I would like to use queues to load up each of these files and then place them in a RandomShuffleQueue that will pull them off so I get a random mix of examples from each class. I thought this code would work.

在这个例子中,每个文件有 10 个例子.

In this example each file has 10 examples.

filenames = ["a", "b", ...]

with self.test_session() as sess:
  # for each file open a queue and get that
  # queue's results. 
  strings = []
  rq = tf.RandomShuffleQueue(1000, 10, [tf.string], shapes=())
  for filename in filenames:
    q = tf.FIFOQueue(99, [tf.string], shapes=())
    q.enqueue([filename]).run()
    q.close().run()
    # read_string just pulls a string from the file
    key, out_string = input_data.read_string(q, IMAGE_SIZE, CHANNELS, LABEL_BYTES)
    strings.append(out_string)

    rq.enqueue([out_string]).run()

  rq.close().run()
  qs = rq.dequeue()
  label, image = input_data.string_to_data(qs, IMAGE_SIZE, CHANNELS, LABEL_BYTES)
  for i in range(11):
    l, im = sess.run([label, image])
    print("L: {}".format(l)

这对于 10 次调用来说效果很好,但是在 11 日它说队列为空.

This works fine for 10 calls, but on the 11th it says that the queue is empty.

我相信这是由于我对这些队列的操作有误解.我向 RandomShuffleQueue 添加了 10 个变量,但是这些变量中的每一个本身都是从队列中提取的,所以我假设在每个文件队列都为空之前队列不会被清空.

I believe this is due to a misunderstanding on my part of what these queues operate on. I add 10 variables to the RandomShuffleQueue, but each of those variables is itself pulling from a queue, so I assumed the queue would not be emptied until each of the file queues was empty.

我在这里做错了什么?

推荐答案

这个问题的正确答案将取决于您拥有多少文件、它们有多大以及它们的大小如何分布.

The correct answer to this question will depend on how many files you have, how large they are, and how their sizes are distributed.

你的例子的直接问题是 rq 只为每个 filename in filenames 获取一个元素,然后队列被关闭.我假设有 10 个 filenames,因为每次调用 rq.dequeue() 都会消耗 rq 的一个元素sess.run([label, image]).由于队列关闭,无法再添加元素,rq.dequeue()操作的第11次激活失败.

The immediate problem with your example is that rq only gets one element for each filename in filenames, then the queue is closed. I'm presuming that there are 10 filenames, since rq.dequeue() will consume one element of rq each time you call sess.run([label, image]). Since the queue is closed, no more elements can be added, and the 11th activation of the rq.dequeue() operation fails.

一般的解决方案是你必须创建额外的线程来保持 rq.enqueue([out_string]) 在循环中运行.TensorFlow 包含一个 QueueRunner 类,旨在简化这一点,以及一些其他处理常见情况的函数.关于线程和队列的文档 解释了它们的使用方式,以及还有一些关于使用队列读取文件的好信息.

The general solution is that you have to create additional threads to keep running rq.enqueue([out_string]) in a loop. TensorFlow includes a QueueRunner class that is designed to simplify this, and some other functions that handle common cases. The documentation for threading and queues explains how they are used, and there is also some good information on using queues to read from files.

对于您的特定问题,您可以处理此问题的一种方法是创建 N 个阅读器(为每个 N 文件).然后你可以 tf.pack() N 个元素(每个读者一个)成批,并使用 enqueue_many 一次将一批添加到 tf.RandomShuffleQueue 具有足够大的容量和 min_after_dequeue 以确保类之间有足够的混合.在 dequeue_many(k) 上调用 dequeue_many(k)code>RandomShuffleQueue 会给你一批 k 元素从每个文件中以等概率采样.

As to your particular problem, one way you could handle this would be to create N readers (for each of N files). You could then tf.pack() N elements (one from each reader) into a batch, and use enqueue_many to add a batch at a time into a tf.RandomShuffleQueue with a sufficiently large capacity and min_after_dequeue to ensure that there is sufficient mixing between the classes. Calling dequeue_many(k) on the RandomShuffleQueue would give you a batch of k elements sampled from each file with equal probability.

这篇关于使用队列从多个输入文件中统一采样的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆