多个小h5文件或一个大h5文件之间的最佳搭配是什么? [英] What is the best beetween multiple small h5 files or one huge?

查看:72
本文介绍了多个小h5文件或一个大h5文件之间的最佳搭配是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理巨大的卫星数据,这些数据将分成小块以提供深度学习模型.我正在使用pytorch,这意味着数据加载器可以使用多线程. [设置:python,Ubuntu 18.04]

我找不到任何答案,就数据访问和存储之间的最佳而言:

  1. 将所有数据注册到一个巨大的HDF5文件中(超过20Go)
  2. 将其拆分为多个(超过16000个)小型HDF5文件(大约 1.4Mo).

是否存在多个线程对一个文件进行多重访问的问题?而在另一种情况下,拥有那么多文件会有影响吗?

解决方案

如果我是我,我会去多个文件(但要读到最后).

直观上讲,您可以将至少一些文件加载​​到内存中,以稍微加快速度(如果可能的话,使用20GB不可能这样做,因为RAM访问要快得多,所以绝对不应该这样做)./p>

您可以在第一个过去期间缓存这些示例(在自定义torch.utils.data.Dataset实例内部),然后检索缓存的示例(例如在list或其他具有更好的缓存位置的内存效率更高的数据结构中),而不是从磁盘读取数据(与Tensorflow的tf.data.Dataset对象及其cache方法类似).

另一方面,这种方法比较麻烦,并且难以正确实施, 但是,如果您仅读取具有多个线程的文件,则应该没问题,并且此操作不应有任何锁定.

请记住要使用pytorch的探查器( torch.utils.bottleneck )来衡量您的方法以精确定位问题并验证解决方案.

I'm working with huge sattelite data that i'm splitting into small tiles to feed a deep learning model. I'm using pytorch, which means the data loader can work with multiple thread. [settings : python, Ubuntu 18.04]

I can't find any answer of which is the best in term of data accessing and storage between :

  1. registering all the data in one huge HDF5 file (over 20Go)
  2. splitting it into multiple (over 16 000) small HDF5 files (approx 1.4Mo).

Is there any problem of multiple access of one file by multiple thread ? and in the other case is there an impact of having that amount of files ?

解决方案

I would go for multiple files if I were you (but read till the end).

Intuitively, you could load at least some files into memory speeding the process a little bit (it is unlikely you would able to do so with 20GB, if you are, than you definitely should as RAM access is much faster).

You could cache those examples (inside custom torch.utils.data.Dataset instance) during the first past and retrieve cached examples (say in list or other more memory-efficient data structure with better cache-locality preferably) instead of reading from disk (similar approach to the one in Tensorflow's tf.data.Dataset object and it's cache method).

On the other hand, this approach is more cumbersome and harder to implement correctly, though if you are only reading the file with multiple threads you should be fine and there shouldn't be any locks on this operation.

Remember to measure your approach with pytorch's profiler (torch.utils.bottleneck) to pinpoint exact problems and verify solutions.

这篇关于多个小h5文件或一个大h5文件之间的最佳搭配是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆