HDF5与包含文件的文件夹有何不同? [英] How is HDF5 different from a folder with files?

查看:106
本文介绍了HDF5与包含文件的文件夹有何不同?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究开源项目,该项目涉及将元数据添加到文件夹.提供的(Python)API使您可以像浏览另一个文件夹一样浏览和访问元数据.因为它只是另一个文件夹.

\folder\.meta\folder\somedata.json

然后我遇到了 HDF5 及其派生 Python和HDF5上阅读有关HDF5的内容我一直在寻找收益与使用文件夹中的文件相比,使用它是很有意义的,但是我遇到的大多数内容都谈到了分层文件格式的好处,即通过其API添加数据的简单性:

>>> import h5py
>>> f = h5py.File("weather.hdf5")
>>> f["/15/temperature"] = 21

或者根据请求仅读取其某些部分的功能(例如随机访问),以及并行执行单个HDF5文件(例如用于多处理)

您可以挂载HDF5文件, https://github.com/zjttoefs/hdfuse5

它甚至拥有一个强大而简单的 Groups Datasets 基础概念,从Wiki上可以读到:

  • 数据集,它们是同类类型的多维数组
  • 组,它们是可以容纳数据集和 其他组

文件替换数据集,用文件夹替换,整个功能集对我来说听起来像什么文件夹中的文件已经完全可以执行此操作.

对于我遇到的每一项好处,没有人能脱颖而出,而不是HDF5独有.

所以我的问题是,如果我要给您一个HDF5文件和一个包含文件的文件夹,且两者都具有相同的内容,那么在哪种情况下HDF5更适合?

已经对HDF5的可移植性有所回应.

听起来很不错,但是我仍然没有得到一个例子,一个场景,其中HDF5会超过一个包含文件的文件夹.为什么在网络上任何计算机,任何文件系统上的文件夹都可以读取并支持并行I/O",无需HDF5解释程序的人都可以读取该文件夹的情况下,为什么有人会考虑使用HDF5.

我要说的是,一个包含文件的文件夹比任何HDF5都具有更高的可移植性.

Thucydides411仅举了一个示例,说明可移植性很重要. https://stackoverflow.com/a/28512028/478949

我认为我从该主题的答案中脱颖而出的是,HDF5非常适用于需要文件和文件夹的组织结构的情况,例如在上面的示例方案中,其中有很多(百万)个小(〜 1个字节)的数据结构;如单个数字或字符串.它通过提供一个子文件系统"来弥补文件系统缺乏的不足,该子文件系统"有利于小型和大型而不是大型和小型.

在计算机图形学中,我们使用它来存储几何模型和有关各个顶点的任意数据,这似乎与它在科学界中的使用非常吻合.

解决方案

作为开发了一个从使用文件文件夹到HDF5的科学项目的人,我认为我可以对HDF5的优点有所了解.

当我开始我的项目时,我正在处理小型测试数据集,并产生少量的输出(以千字节为单位).我从最简单的数据格式开始,即将表编码为ASCII.对于我处理的每个对象,我都是在ASCII表上生成的.

我开始将代码应用于对象组,这意味着在每次运行结束时编写多个ASCII表,以及一个额外的ASCII表,其中包含与整个组相关的输出.对于每个组,我现在都有一个看起来像这样的文件夹:

+ group
|    |-- object 1
|    |-- object 2
|    |-- ...
|    |-- object N
|    |-- summary

在这一点上,我开始遇到我的第一个困难. ASCII文件的读写速度非常慢,并且它们不能非常有效地打包数字信息,因为每个数字都需要一个完整的字节来编码,而不是大约3.3位.因此,我转而将每个对象写为自定义二进制文件,这加快了I/O并减小了文件大小.

当我扩展到处理大量(成千上万)组时,突然发现自己正在处理大量文件和文件夹.对于许多文件系统来说,拥有太多小文件可能是一个问题(许多文件系统在可存储文件的数量上受到限制,而不管它们有多少磁盘空间).我还开始发现,当我尝试对整个数据集进行后处理时,用于读取许多小文件的磁盘I/O开始占用大量时间.我试图通过合并文件来解决这些问题,所以我只为每个组生成两个文件:

+ group 1
|    |-- objects
|    |-- summary
+ group 2
|    |-- objects
|    |-- summary
...

我还想压缩我的数据,所以我开始为组集合创建.tar.gz文件.

在这一点上,我的整个数据方案变得非常繁琐,并且有一个风险,就是如果我想将我的数据交给其他人,将需要大量的工作来向他们解释如何使用它.例如,包含这些对象的二进制文件具有其自己的内部结构,这些结构仅存在于存储库中的README文件中以及我办公室的一pad纸上.谁想要读取我的组合对象二进制文件之一,则必须知道标头中每个元数据条目的字节偏移量,类型和字节序,以及文件中每个对象的字节偏移量.如果他们不这样做,那么文件对他们将是无聊的.

我对数据进行分组和压缩的方式也带来了问题.假设我想找到一个对象.我必须找到它所在的.tar.gz文件,将存档的所有内容解压缩到一个临时文件夹,导航到我感兴趣的组,并使用自己的自定义API检索对象以读取我的二进制文件.完成后,我将删除临时解压缩的文件.这不是一个优雅的解决方案.

在这一点上,我决定切换到标准格式. HDF5之所以具有吸引力,原因有很多.首先,我可以将数据的整体组织划分为组,对象数据集和摘要数据集.其次,我可以放弃自定义的二进制文件I/O API,而仅使用多维数组数据集将所有对象存储在一个组中.我什至可以创建更复杂的数据类型的数组,例如C结构的数组,而不必仔细记录每个条目的字节偏移量.接下来,HDF5具有分块压缩,这对于数据的最终用户可以是完全透明的.因为压缩是分块的,所以如果我认为用户将要查看单个对象,则可以将每个对象压缩到单独的块中,以便仅对用户感兴趣的数据集部分进行解压缩.分块压缩是一项非常强大的功能.

最后,我现在可以将文件提供给某人,而无需过多解释其内部组织方式.最终用户可以在命令行或GUI HDFView上以Python,C,Fortran或h5ls格式读取文件,并查看其中的内容.对于我的自定义二进制格式,这是不可能的,更不用说我的.tar.gz集合了.

当然,可以使用文件夹,ASCII和自定义二进制文件复制您使用HDF5可以执行的所有操作.那是我最初所做的,但是却变得很头疼,最终,HDF5以一种高效且可移植的方式完成了我整合在一起的所有工作.

I'm working on an open source project dealing with adding metadata to folders. The provided (Python) API lets you browse and access metadata like it was just another folder. Because it is just another folder.

\folder\.meta\folder\somedata.json

Then I came across HDF5 and its derivation Alembic.

Reading up on HDF5 in the book Python and HDF5 I was looking for benefits to using it compared to using files in folders, but most of what I came across spoke about the benefits of a hierarchical file-format in terms of its simplicity in adding data via its API:

>>> import h5py
>>> f = h5py.File("weather.hdf5")
>>> f["/15/temperature"] = 21

Or its ability to read only certain parts of it upon request (e.g. random access), and parallel execution of a single HDF5 file (e.g. for multiprocessing)

You could mount HDF5 files, https://github.com/zjttoefs/hdfuse5

It even boasts a strong yet simple foundation concept of Groups and Datasets which from wiki reads:

  • Datasets, which are multidimensional arrays of a homogeneous type
  • Groups, which are container structures which can hold datasets and other groups

Replace Dataset with File and Group with Folder and the whole feature-set sounds to me like what files in folders are already fully capable of doing.

For every benefit I came across, not one stood out as being exclusive to HDF5.

So my question is, if I were to give you one HDF5 file and one folder with files, both with identical content, in which scenario would HDF5 be better suited?

Edit:

Having gotten some responses about the portability of HDF5.

It sounds lovely and all, but I still haven't been given an example, a scenario, in which an HDF5 would out-do a folder with files. Why would someone consider using HDF5 when a folder is readable on any computer, any file-system, over a network, supports "parallel I/O", is readable by humans without an HDF5 interpreter.

I would go as far as to say, a folder with files is far more portable than any HDF5.

Edit 2:

Thucydides411 just gave an example of a scenario where portability matters. https://stackoverflow.com/a/28512028/478949

I think what I'm taking away from the answers in this thread is that HDF5 is well suited for when you need the organisational structure of files and folders, like in the example scenario above, with lots (millions) small (~1 byte) data structures; like individual numbers or strings. That it makes up for what file-systems lack by providing a "sub file-system" favouring the small and many as opposed to few and large.

In computer graphics, we use it to store geometric models and arbitrary data about individual vertices which seems to align quite well with it's use in the scientific community.

解决方案

As someone who developed a scientific project that went from using folders of files to HDF5, I think I can shed some light on the advantages of HDF5.

When I began my project, I was operating on small test datasets, and producing small amounts of output, in the range of kilobytes. I began with the easiest data format, tables encoded as ASCII. For each object I processed, I produced on ASCII table.

I began applying my code to groups of objects, which meant writing multiple ASCII tables at the end of each run, along with an additional ASCII table containing output related to the entire group. For each group, I now had a folder that looked like:

+ group
|    |-- object 1
|    |-- object 2
|    |-- ...
|    |-- object N
|    |-- summary

At this point, I began running into my first difficulties. ASCII files are very slow to read and write, and they don't pack numeric information very efficiently, because each digit takes a full Byte to encode, rather than ~3.3 bits. So I switched over to writing each object as a custom binary file, which sped up I/O and decreased file size.

As I scaled up to processing large numbers (tens of thousands to millions) of groups, I suddenly found myself dealing with an extremely large number of files and folders. Having too many small files can be a problem for many filesystems (many filesystems are limited in the number of files they can store, regardless of how much disk space there is). I also began to find that when I would try to do post-processing on my entire dataset, the disk I/O to read many small files was starting to take up an appreciable amount of time. I tried to solve these problems by consolidating my files, so that I only produced two files for each group:

+ group 1
|    |-- objects
|    |-- summary
+ group 2
|    |-- objects
|    |-- summary
...

I also wanted to compress my data, so I began creating .tar.gz files for collections of groups.

At this point, my whole data scheme was getting very cumbersome, and there was a risk that if I wanted to hand my data to someone else, it would take a lot of effort to explain to them how to use it. The binary files that contained the objects, for example, had their own internal structure that existed only in a README file in a repository and on a pad of paper in my office. Whoever wanted to read one of my combined object binary files would have to know the byte offset, type and endianness of each metadata entry in the header, and the byte offset of every object in the file. If they didn't, the file would be gibberish to them.

The way I was grouping and compressing data also posed problems. Let's say I wanted to find one object. I would have to locate the .tar.gz file it was in, unzip the entire contents of the archive to a temporary folder, navigate to the group I was interested in, and retrieve the object with my own custom API to read my binary files. After I was done, I would delete the temporarily unzipped files. It was not an elegant solution.

At this point, I decided to switch to a standard format. HDF5 was attractive for a number of reasons. Firstly, I could keep the overall organization of my data into groups, object datasets and summary datasets. Secondly, I could ditch my custom binary file I/O API, and just use a multidimensional array dataset to store all the objects in a group. I could even create arrays of more complicated datatypes, like arrays of C structs, without having to meticulously document the byte offsets of every entry. Next, HDF5 has chunked compression which can be completely transparent to the end user of the data. Because the compression is chunked, if I think users are going to want to look at individual objects, I can have each object compressed in a separate chunk, so that only the part of the dataset the user is interested in needs to be decompressed. Chunked compression is an extremely powerful feature.

Finally, I can just give a single file to someone now, without having to explain much about how it's internally organized. The end user can read the file in Python, C, Fortran, or h5ls on the commandline or the GUI HDFView, and see what's inside. That wasn't possible with my custom binary format, not to mention my .tar.gz collections.

Sure, it's possible to replicate everything you can do with HDF5 with folders, ASCII and custom binary files. That's what I originally did, but it became a major headache, and in the end, HDF5 did everything I was kluging together in an efficient and portable way.

这篇关于HDF5与包含文件的文件夹有何不同?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆