HDF5存储开销 [英] HDF5 Storage Overhead

查看:155
本文介绍了HDF5存储开销的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在将大量的小型数据集写入HDF5文件,结果文件大小大约是我对所放入数据的幼稚制表所期望的10倍.我的数据按层次结构进行组织,如下所示:

I'm writing a large number of small datasets to an HDF5 file, and the resulting filesize is about 10x what I would expect from a naive tabulation of the data I'm putting in. My data is organized hierarchically as follows:

group 0
    -> subgroup 0
        -> dataset (dimensions: 100 x 4, datatype: float)
        -> dataset (dimensions: 100, datatype: float)
    -> subgroup 1
        -> dataset (dimensions: 100 x 4, datatype: float)
        -> dataset (dimensions: 100, datatype: float)
    ...
group 1
...

每个子组应占用500 * 4字节= 2000字节,而忽略开销.我没有在数据旁边存储任何属性.但是,在测试中,我发现每个子组大约占用4 kB,大约是我期望的两倍.我知道会有一些开销,但是它是从哪里来的,又该如何减少呢?是代表群体结构吗?

Each subgroup should take up 500 * 4 Bytes = 2000 Bytes, ignoring overhead. I don't store any attributes alongside the data. Yet, in testing, I find that each subgroup takes up about 4 kB, or about twice what I would expect. I understand that there is some overhead, but where is it coming from, and how can I reduce it? Is it in representing the group structure?

更多信息: 如果我将每个子组中的两个数据集的尺寸分别增加到1000 x 4和1000,则每个子组将占用约22,250字节,而不是我期望的20,000字节.这意味着每个子组的开销为2.2 kB,这与我在较小的数据集大小下获得的结果一致.有什么办法可以减少这种开销?

More information: If I increase the dimensions of the two datasets in each subgroup to 1000 x 4 and 1000, then each subgroup takes up about 22,250 Bytes, rather than the flat 20,000 Bytes I expect. This implies an overhead of 2.2 kB per subgroup, and is consistent with the results I was getting with the smaller dataset sizes. Is there any way to reduce this overhead?

推荐答案

我将回答我自己的问题.仅表示组结构所涉及的开销就足够了,以至于没有必要存储小阵列或拥有许多组,每个组仅包含少量数据.似乎没有任何方法可以减少每组的开销,我测得的开销约为2.2 kB.

I'll answer my own question. The overhead involved just in representing the group structure is enough that it doesn't make sense to store small arrays, or to have many groups, each containing only a small amount of data. There does not seem to be any way to reduce the overhead per group, which I measured at about 2.2 kB.

我通过将每个子组中的两个数据集合并为一个(100 x 5)数据集来解决了这个问题.然后,我消除了子组,并将每个组中的所有数据集合并为3D数据集.因此,如果我以前有N个子组,那么现在每个组中都有一个数据集,其形状为(N x 100 x 5).因此,我节省了以前存在的N * 2.2 kB开销.而且,由于HDF5的内置压缩在较大的阵列上更有效,所以我现在得到了优于1:1的整体打包率,而以前,开销只占用了文件空间的一半,并且压缩完全无效.

I resolved this issue by combining the two datasets in each subgroup into a (100 x 5) dataset. Then, I eliminated the subgroups, and combined all of the datasets in each group into a 3D dataset. Thus, if I had N subgroups previously, I now have one dataset in each group, with shape (N x 100 x 5). I thus save the N * 2.2 kB overhead that was previously present. Moreover, since HDF5's built-in compression is more effective with larger arrays, I now get a better than 1:1 overall packing ratio, whereas before, overhead took up half the space of the file, and compression was completely ineffective.

课程是要避免HDF5文件中复杂的组结构,并尝试将尽可能多的数据合并到每个数据集中.

The lesson is to avoid complicated group structures in HDF5 files, and to try to combine as much data as possible into each dataset.

这篇关于HDF5存储开销的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆