TFRecords 和记录洗牌 [英] TFRecords and record shuffling

查看:29
本文介绍了TFRecords 和记录洗牌的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的理解是,对每个 epoch 的训练样本进行混洗是一种很好的做法,以便每个小批量包含整个数据集的一个很好的随机样本.如果我将整个数据集转换为包含 TFRecords 的单个文件,那么在加载整个数据集的情况下如何实现这种改组?我的理解是没有对 TFRecord 文件的有效随机访问.因此,具体而言,我正在寻找有关如何在这种情况下使用 TFRecord 文件的指导.

My understanding is that it is good practice to shuffle training samples for each epoch so that each mini-batch contains a nice random sample of the entire dataset. If I convert my entire data-set into a single file containing TFRecords then how is this shuffling to be achieved short of loading the entire data-set? My understanding is that there is no efficient random access to TFRecord files. So, to be specific, I am looking for guidance as to how TFRecord files are used in this scenario.

推荐答案

并非如此 - 您可以通过将输入分片到多个输入数据文件中,然后对它们进行处理来稍微改进混合如本答案所述.

It's not - you can improve the mixing somewhat by sharding your input into multiple input data files, and then treating them as explained in this answer.

如果您需要任何接近完美"洗牌的东西,您需要将其读入内存,但实际上对于大多数事情,您可能只需将其拆分为 100 或 1000 个文件,然后就可以获得足够好"的洗牌效果使用足以容纳 8-16 个文件的数据的 shuffle 队列.

If you need anything close to "perfect" shuffling, you would need to read it into memory, but in practice for most things, you'll probably get "good enough" shuffling by just splitting into 100 or 1000 files and then using a shuffle queue that's big enough to hold 8-16 files worth of data.

我很想写一个可以溢出到磁盘的外部随机洗牌队列,但它在我的优先级列表中非常低——如果有人想贡献一个,我自愿审查它.:)

I have an itch in the back of my head to write an external random shuffle queue that can spill to disk, but it's very low on my priority list -- if someone wanted to contribute one, I'm volunteering to review it. :)

这篇关于TFRecords 和记录洗牌的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆