在 tensorflow 中与许多输入数据文件很好地混合 [英] Getting good mixing with many input datafiles in tensorflow

查看:17
本文介绍了在 tensorflow 中与许多输入数据文件很好地混合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在与 tensorflow 合作,希望能训练一个深度 CNN 来为围棋游戏进行移动预测.我创建的数据集由 100,000 个二进制数据文件组成,其中每个数据文件对应一个记录的游戏,并包含大约 200 个训练样本(游戏中的每个动作一个).我相信在使用 SGD 时获得良好的混合非常重要.我希望我的批次包含来自不同游戏的样本和来自游戏不同阶段的样本.因此,例如,简单地从 100 个文件的开头读取一个样本并且洗牌并不好 b/c 这 100 个样本都将是每个游戏的第一步.

I'm working with tensorflow hoping to train a deep CNN to do move prediction for the game Go. The dataset I created consists of 100,000 binary data files, where each datafile corresponds to a recorded game and contains roughly 200 training samples (one for each move in the game). I believe it will be very important to get good mixing when using SGD. I'd like my batches to contain samples from different games AND samples from different stages of the games. So for example simply reading one sample from the start of 100 files and shuffling isn't good b/c those 100 samples will all be the first move of each game.

我已阅读有关从文件中提供数据的教程,但我不确定他们提供的库是否满足我的需要.如果我自己硬编码,我基本上会初始化一堆文件指针到每个文件中的随机位置,然后从随机文件中提取样本,相应地增加文件指针.

I have read the tutorial on feeding data from files but I'm not sure if their provided libraries do what I need. If I were to hard code it myself I would basically initialize a bunch of file pointers to random locations within each file and then pull samples from random files, incrementing the file pointers accordingly.

那么,我的问题是 tensorflow 是否提供这种功能,或者编写自己的代码来创建批处理会更容易吗?

So, my question is does tensorflow provide this sort of functionality or would it be easier to write my own code for creating batches?

推荐答案

是的 - 您想要的是结合使用两件事.(请注意,此答案是为 TensorFlow v1 编写的,并且某些功能已被新的 tf.data 管道取代;我已经更新了答案以指向 v1 兼容版本的事物,但如果您要这个新代码的答案,请参考 tf.data 代替.)

Yes - what you want is to use a combination of two things. (Note that this answer was written for TensorFlow v1, and some of the functionality has been replaced by the new tf.data pipelines; I've updated the answers to point to the v1 compat versions of things, but if you're coming to this answer for new code, please consult tf.data instead.)

首先,通过使用 tf.train.string_input_producer 带有 shuffle=True 输入您使用的任何输入法(如果您可以将示例转换成 tf.Example proto 格式,这很容易与 parse_example 一起使用).非常清楚,你把文件名列表放在string_input_producer中,然后用另一种方法读取它们,比如read_file等.

First, randomly shuffle the order in which you input your datafiles, by reading from them using a tf.train.string_input_producer with shuffle=True that feeds into whatever input method you use (if you can put your examples into tf.Example proto format, that's easy to use with parse_example). To be very clear, you put the list of filenames in the string_input_producer and then read them with another method such as read_file, etc.

其次,您需要以更细的粒度进行混合.您可以通过将输入示例输入 来完成此操作tf.train.shuffle_batch节点,容量大,min_after_dequeue值大.一种特别好的方法是使用 shuffle_batch_join 接收来自多个文件的输入,这样您就可以进行大量混合.将批次的容量设置得足够大,以便在不耗尽 RAM 的情况下混合均匀.数以万计的示例通常效果很好.

Second, you need to mix at a finer granularity. You can accomplish this by feeding the input examples into a tf.train.shuffle_batch node with a large capacity and large value of min_after_dequeue. One particularly nice way is to use a shuffle_batch_join that receives input from multiple files, so that you get a lot of mixing. Set the capacity of the batch big enough to mix well without exhausting your RAM. Tens of thousands of examples usually works pretty well.

请记住,批处理函数向 QUEUE_RUNNERS 集合添加了一个 QueueRunner,因此您需要运行 tf.train.start_queue_runners()

Keep in mind that the batch functions add a QueueRunner to the QUEUE_RUNNERS collection, so you need to run tf.train.start_queue_runners()

这篇关于在 tensorflow 中与许多输入数据文件很好地混合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆