H5py中更大的压缩文件 [英] compressed files bigger in h5py

查看:87
本文介绍了H5py中更大的压缩文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用h5py从python以HDF5格式保存numpy数组.最近,我尝试应用压缩,并且得到的文件更大...

I'm using h5py to save numpy arrays in HDF5 format from python. Recently, I tried to apply compression and the size of the files I get is bigger...

我是从这样的事情开始的(每个文件都有几个数据集)

I went from things (every file has several datasets) like this

self._h5_current_frame.create_dataset(
        'estimated position', shape=estimated_pos.shape, 
         dtype=float, data=estimated_pos)

对于这样的事情

self._h5_current_frame.create_dataset(
        'estimated position', shape=estimated_pos.shape, dtype=float,
        data=estimated_pos, compression="gzip", compression_opts=9)

在特定示例中,压缩文件的大小为172K,未压缩文件的大小为72K(并且h5diff报告两个文件相等).我尝试了一个更基本的示例,它可以按预期工作...但是在我的程序中没有.

In particular examples, the size of the compressed file is 172K and that of the uncompressed file is 72K (and h5diff reports both files are equal). I tried a more basic example and it works as expected...but not in my program.

那怎么可能?我不认为gzip算法可以提供更大的压缩文件,因此它可能与h5py及其使用有关:-/有什么想法吗?

How is that possible? I don't think gzip algorithm ever gives a bigger compressed file, so it's probably related with h5py and use thereof :-/ Any ideas?

干杯!

h5stat 的输出中,压缩版本似乎保存了很多元数据(在输出的最后几行中)

At the sight of the output from h5stat, it seems the compressed version saves a lot of metadata (in the last few lines of the output)

Filename: res_totolaca_jue_2015-10-08_17:06:30_19387.hdf5
File information
    # of unique groups: 21
    # of unique datasets: 56
    # of unique named datatypes: 0
    # of unique links: 0
    # of unique other: 0
    Max. # of links to object: 1
    Max. # of objects in group: 5
File space information for file metadata (in bytes):
    Superblock extension: 0
    User block: 0
    Object headers: (total/unused)
        Groups: 3798/503
        Datasets(exclude compact data): 15904/9254
        Datatypes: 0/0
    Groups:
        B-tree/List: 0
        Heap: 0
    Attributes:
        B-tree/List: 0
        Heap: 0
    Chunked datasets:
        Index: 116824
    Datasets:
        Heap: 0
    Shared Messages:
        Header: 0
        B-tree/List: 0
        Heap: 0
Small groups (with 0 to 9 links):
    # of groups with 1 link(s): 1
    # of groups with 2 link(s): 5
    # of groups with 3 link(s): 5
    # of groups with 5 link(s): 10
    Total # of small groups: 21
Group bins:
    # of groups with 1 - 9 links: 21
    Total # of groups: 21
Dataset dimension information:
    Max. rank of datasets: 3
    Dataset ranks:
        # of dataset with rank 1: 51
        # of dataset with rank 2: 3
        # of dataset with rank 3: 2
1-D Dataset information:
    Max. dimension size of 1-D datasets: 624
    Small 1-D datasets (with dimension sizes 0 to 9):
        # of datasets with dimension sizes 1: 36
        # of datasets with dimension sizes 2: 2
        # of datasets with dimension sizes 3: 2
        Total # of small datasets: 40
    1-D Dataset dimension bins:
        # of datasets with dimension size 1 - 9: 40
        # of datasets with dimension size 10 - 99: 2
        # of datasets with dimension size 100 - 999: 9
        Total # of datasets: 51
Dataset storage information:
    Total raw data size: 33602
    Total external raw data size: 0
Dataset layout information:
    Dataset layout counts[COMPACT]: 0
    Dataset layout counts[CONTIG]: 2
    Dataset layout counts[CHUNKED]: 54
    Number of external files : 0
Dataset filters information:
    Number of datasets with:
        NO filter: 2
        GZIP filter: 54
        SHUFFLE filter: 0
        FLETCHER32 filter: 0
        SZIP filter: 0
        NBIT filter: 0
        SCALEOFFSET filter: 0
        USER-DEFINED filter: 0
Dataset datatype information:
    # of unique datatypes used by datasets: 4
    Dataset datatype #0:
        Count (total/named) = (20/0)
        Size (desc./elmt) = (14/8)
    Dataset datatype #1:
        Count (total/named) = (17/0)
        Size (desc./elmt) = (22/8)
    Dataset datatype #2:
        Count (total/named) = (10/0)
        Size (desc./elmt) = (22/8)
    Dataset datatype #3:
        Count (total/named) = (9/0)
        Size (desc./elmt) = (14/8)
    Total dataset datatype count: 56
Small # of attributes (objects with 1 to 10 attributes):
    Total # of objects with small # of attributes: 0
Attribute bins:
    Total # of objects with attributes: 0
    Max. # of attributes to objects: 0
Summary of file space information:
  File metadata: 136526 bytes
  Raw data: 33602 bytes
  Unaccounted space: 5111 bytes
Total space: 175239 bytes

未压缩的文件

Filename: res_totolaca_jue_2015-10-08_17:03:04_19267.hdf5
File information
    # of unique groups: 21
    # of unique datasets: 56
    # of unique named datatypes: 0
    # of unique links: 0
    # of unique other: 0
    Max. # of links to object: 1
    Max. # of objects in group: 5
File space information for file metadata (in bytes):
    Superblock extension: 0
    User block: 0
    Object headers: (total/unused)
        Groups: 3663/452
        Datasets(exclude compact data): 15904/10200
        Datatypes: 0/0
    Groups:
        B-tree/List: 0
        Heap: 0
    Attributes:
        B-tree/List: 0
        Heap: 0
    Chunked datasets:
        Index: 0
    Datasets:
        Heap: 0
    Shared Messages:
        Header: 0
        B-tree/List: 0
        Heap: 0
Small groups (with 0 to 9 links):
    # of groups with 1 link(s): 1
    # of groups with 2 link(s): 5
    # of groups with 3 link(s): 5
    # of groups with 5 link(s): 10
    Total # of small groups: 21
Group bins:
    # of groups with 1 - 9 links: 21
    Total # of groups: 21
Dataset dimension information:
    Max. rank of datasets: 3
    Dataset ranks:
        # of dataset with rank 1: 51
        # of dataset with rank 2: 3
        # of dataset with rank 3: 2
1-D Dataset information:
    Max. dimension size of 1-D datasets: 624
    Small 1-D datasets (with dimension sizes 0 to 9):
        # of datasets with dimension sizes 1: 36
        # of datasets with dimension sizes 2: 2
        # of datasets with dimension sizes 3: 2
        Total # of small datasets: 40
    1-D Dataset dimension bins:
        # of datasets with dimension size 1 - 9: 40
        # of datasets with dimension size 10 - 99: 2
        # of datasets with dimension size 100 - 999: 9
        Total # of datasets: 51
Dataset storage information:
    Total raw data size: 50600
    Total external raw data size: 0
Dataset layout information:
    Dataset layout counts[COMPACT]: 0
    Dataset layout counts[CONTIG]: 56
    Dataset layout counts[CHUNKED]: 0
    Number of external files : 0
Dataset filters information:
    Number of datasets with:
        NO filter: 56
        GZIP filter: 0
        SHUFFLE filter: 0
        FLETCHER32 filter: 0
        SZIP filter: 0
        NBIT filter: 0
        SCALEOFFSET filter: 0
        USER-DEFINED filter: 0
Dataset datatype information:
    # of unique datatypes used by datasets: 4
    Dataset datatype #0:
        Count (total/named) = (20/0)
        Size (desc./elmt) = (14/8)
    Dataset datatype #1:
        Count (total/named) = (17/0)
        Size (desc./elmt) = (22/8)
    Dataset datatype #2:
        Count (total/named) = (10/0)
        Size (desc./elmt) = (22/8)
    Dataset datatype #3:
        Count (total/named) = (9/0)
        Size (desc./elmt) = (14/8)
    Total dataset datatype count: 56
Small # of attributes (objects with 1 to 10 attributes):
    Total # of objects with small # of attributes: 0
Attribute bins:
    Total # of objects with attributes: 0
    Max. # of attributes to objects: 0
Summary of file space information:
  File metadata: 19567 bytes
  Raw data: 50600 bytes
  Unaccounted space: 5057 bytes
Total space: 75224 bytes

推荐答案

首先,这是一个可重现的示例:

First, here's a reproducible example:

import h5py
from scipy.misc import lena

img = lena()    # some compressible image data

f1 = h5py.File('nocomp.h5', 'w')
f1.create_dataset('img', data=img)
f1.close()

f2 = h5py.File('complevel_9.h5', 'w')
f2.create_dataset('img', data=img, compression='gzip', compression_opts=9)
f2.close()

f3 = h5py.File('complevel_0.h5', 'w')
f3.create_dataset('img', data=img, compression='gzip', compression_opts=0)
f3.close()

现在让我们看一下文件大小:

Now let's look at the file sizes:

~$ h5stat -S nocomp.h5
Filename: nocomp.h5
Summary of file space information:
  File metadata: 1304 bytes
  Raw data: 2097152 bytes
  Unaccounted space: 840 bytes
Total space: 2099296 bytes

~$ h5stat -S complevel_9.h5
Filename: complevel_9.h5
Summary of file space information:
  File metadata: 11768 bytes
  Raw data: 302850 bytes
  Unaccounted space: 1816 bytes
Total space: 316434 bytes

~$ h5stat -S complevel_0.h5
Filename: complevel_0.h5
Summary of file space information:
  File metadata: 11768 bytes
  Raw data: 2098560 bytes
  Unaccounted space: 1816 bytes
Total space: 2112144 bytes

在我的示例中,使用 gzip -9 进行压缩是有意义的-尽管它需要额外的〜10kB元数据,但这远远超过了图像数据大小的〜1794kB减少(大约7:1的压缩率).最终结果是文件总大小减少了约6.6倍.

In my example, compression with gzip -9 makes sense - although it requires an extra ~10kB of metadata, this is more than outweighed by a ~1794kB decrease in the size of the image data (about a 7:1 compression ratio). The net result is a ~6.6 fold reduction in total file size.

但是,在您的示例中,压缩仅将原始数据的大小减少了约16kB(压缩比约为1.5:1),而元数据大小增加了116kB大大超过了压缩量.元数据大小的增加之所以比我的示例大得多的原因,可能是因为您的文件包含56个数据集,而不仅仅是一个.

However, in your example the compression only reduces the size of your raw data by ~16kB (a compression ratio of about 1.5:1), which is massively outweighed by a 116kB increase in the size of the metadata. The reason why the increase in metadata size is so much larger than for my example is probably because your file contains 56 datasets rather than just one.

即使gzip神奇地将原始数据的大小减小为零,您最终仍然会得到比未压缩版本大约1.8倍的文件.元数据的大小或多或少地保证了与数组的大小成线性关系,因此,如果数据集大得多,那么您将开始看到压缩它们的好处.就目前而言,您的数组是如此之小,以至于您不太可能从压缩中获得任何收益.

Even if gzip magically reduced the size of your raw data to zero you would still end up with a file that was ~1.8 times larger than the uncompressed version. The size of the metadata is more or less guaranteed to scale sublinearly with the size of your arrays, so if your datasets were much larger then you would start to see some benefit from compressing them. As it stands, your array is so small that it's unlikely that you'll gain anything from compression.

压缩版本需要更多元数据的原因实际上与压缩本身无关,而是与以下事实有关:为了使用压缩过滤器,数据集必须为 B树索引大块.

The reason why the compressed version needs so much more metadata is not really to do with the compression per se, but rather to do with the fact that in order to use compression filters the dataset needs to be split into fixed-size chunks. Presumably a lot of the extra metadata is being used to store the B-tree that is needed to index the chunks.

f4 = h5py.File('nocomp_autochunked.h5', 'w')
# let h5py pick a chunk size automatically
f4.create_dataset('img', data=img, chunks=True)
print(f4['img'].chunks)
# (32, 64)
f4.close()

f5 = h5py.File('nocomp_onechunk.h5', 'w')
# make the chunk shape the same as the shape of the array, so that there 
# is only one chunk
f5.create_dataset('img', data=img, chunks=img.shape)
print(f5['img'].chunks)
# (512, 512)
f5.close()

f6 = h5py.File('complevel_9_onechunk.h5', 'w')
f6.create_dataset('img', data=img, chunks=img.shape, compression='gzip',
                  compression_opts=9)
f6.close()

以及生成的文件大小:

~$ h5stat -S nocomp_autochunked.h5
Filename: nocomp_autochunked.h5
Summary of file space information:
  File metadata: 11768 bytes
  Raw data: 2097152 bytes
  Unaccounted space: 1816 bytes
Total space: 2110736 bytes

~$ h5stat -S nocomp_onechunk.h5
Filename: nocomp_onechunk.h5
Summary of file space information:
  File metadata: 3920 bytes
  Raw data: 2097152 bytes
  Unaccounted space: 96 bytes
Total space: 2101168 bytes

~$ h5stat -S complevel_9_onechunk.h5
Filename: complevel_9_onechunk.h5
Summary of file space information:
  File metadata: 3920 bytes
  Raw data: 305051 bytes
  Unaccounted space: 96 bytes
Total space: 309067 bytes

很明显,分块是产生额外元数据而不是压缩的原因,因为 nocomp_autochunked.h5 包含与上述 complevel_0.h5 完全相同的元数据,并引入了压缩为 complevel_9_onechunk.h5 中的分块版本对元数据总量没有影响.

It's obvious that chunking is what incurs the extra metadata rather than compression, since nocomp_autochunked.h5 contains exactly the same amount of metadata as complevel_0.h5 above, and introducing compression to the chunked version in complevel_9_onechunk.h5 made no difference to the total amount of metadata.

在本示例中,增加块大小以使数组作为单个块存储,则元数据量减少了大约3倍.在您的情况下,这将产生多大的不同可能取决于h5py如何为输入数据集自动选择块大小.有趣的是,这还导致压缩率非常轻微的降低,这并不是我所预期的.

Increasing the chunk size such that the array is stored as a single chunk reduced the amount of metadata by a factor of about 3 in this example. How much difference this would make in your case will probably depend on how h5py automatically selects a chunk size for your input dataset. Interestingly this also resulted in a very slight reduction in the compression ratio, which is not what I would have predicted.

请记住,拥有较大的块也有不利之处.每当您要访问块中的单个元素时,都需要对整个块进行解压缩并将其读入内存.对于大型数据集,这可能会对性能造成灾难性的影响,但是在您的情况下,数组是如此之小,以至于不值得担心.

Bear in mind that there are also disadvantages to having larger chunks. Whenever you want to access a single element within a chunk, the whole chunk needs to be decompressed and read into memory. For a large dataset this can be disastrous for performance, but in your case the arrays are so small that it's probably not worth worrying about.

您应该考虑的另一件事是,是否可以将数据集存储在单个数组中,而不是在许多小数组中.例如,如果具有相同dtype的 K 2D数组,每个数组的尺寸为 MxN ,则可以将其更有效地存储在 KxMxN 3D中而不是大量的小型数据集.我对您的数据了解不足,无法确定这是否可行.

Another thing you should consider is whether you can store your datasets within a single array rather than lots of small arrays. For example, if you have K 2D arrays of the same dtype that each have dimensions MxN then you could store them more efficiently in a KxMxN 3D array rather than lots of small datasets. I don't know enough about your data to know whether this is feasible.

这篇关于H5py中更大的压缩文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆