与Tensorflow中的许多输入数据文件良好混合 [英] Getting good mixing with many input datafiles in tensorflow

查看:70
本文介绍了与Tensorflow中的许多输入数据文件良好混合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在与tensorflow合作,希望训练一个深层的CNN来为Go游戏做移动预测.我创建的数据集包含100,000个二进制数据文件,其中每个数据文件都对应于一个已录制的游戏,并且包含大约200个训练样本(游戏中的每个移动样本一个).我相信使用SGD时要获得良好的混音非常重要.我希望我的批次中包含来自不同游戏的样本以及来自游戏不同阶段的样本.因此,例如,仅从100个文件的开头读取一个样本,并且洗牌是不好的b/c,这100个样本将成为每个游戏的第一步.

I'm working with tensorflow hoping to train a deep CNN to do move prediction for the game Go. The dataset I created consists of 100,000 binary data files, where each datafile corresponds to a recorded game and contains roughly 200 training samples (one for each move in the game). I believe it will be very important to get good mixing when using SGD. I'd like my batches to contain samples from different games AND samples from different stages of the games. So for example simply reading one sample from the start of 100 files and shuffling isn't good b/c those 100 samples will all be the first move of each game.

我已经阅读了有关从文件中获取数据的教程,但不确定它们提供的库是否满足我的需求.如果我自己进行硬编码,那么我基本上会初始化一堆指向每个文件中随机位置的文件指针,然后从随机文件中提取样本,并相应地增加文件指针.

I have read the tutorial on feeding data from files but I'm not sure if their provided libraries do what I need. If I were to hard code it myself I would basically initialize a bunch of file pointers to random locations within each file and then pull samples from random files, incrementing the file pointers accordingly.

所以,我的问题是,tensorflow是否提供了这种功能,或者编写自己的代码来创建批处理会更容易吗?

So, my question is does tensorflow provide this sort of functionality or would it be easier to write my own code for creating batches?

推荐答案

是的-您想要的是结合使用两种方法.

Yes - what you want is to use a combination of two things.

首先,使用 tf.train.string_input_producer ,可将其输入到您使用的任何输入方法中(如果您可以将示例放入tf.Example原型格式中,则很容易在parse_example中使用).非常清楚,您将文件名列表放在string_input_producer中,然后使用read_file等其他方法读取文件名.

First, randomly shuffle the order in which you input your datafiles, by reading from them using a tf.train.string_input_producer with shuffle=True that feeds into whatever input method you use (if you can put your examples into tf.Example proto format, that's easy to use with parse_example). To be very clear, you put the list of filenames in the string_input_producer and then read them with another method such as read_file, etc.

第二,您需要以更细的粒度混合.您可以通过将输入示例输入 tf.train.shuffle_batch 节点,具有大容量和min_after_dequeue的较大值.一种特别好的方法是使用shuffle_batch_join来接收来自多个文件的输入,以便您进行大量的混合.将批处理的容量设置得足够大,以便在不耗尽RAM的情况下很好地混合.数以万计的示例通常效果很好.

Second, you need to mix at a finer granularity. You can accomplish this by feeding the input examples into a tf.train.shuffle_batch node with a large capacity and large value of min_after_dequeue. One particularly nice way is to use a shuffle_batch_join that receives input from multiple files, so that you get a lot of mixing. Set the capacity of the batch big enough to mix well without exhausting your RAM. Tens of thousands of examples usually works pretty well.

请记住,批处理功能会将QueueRunner添加到QUEUE_RUNNERS集合中,因此您需要运行

Keep in mind that the batch functions add a QueueRunner to the QUEUE_RUNNERS collection, so you need to run tf.train.start_queue_runners()

这篇关于与Tensorflow中的许多输入数据文件良好混合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆