从单独压缩的块创建gzip流 [英] Creating a gzip stream from separately compressed chunks

查看:197
本文介绍了从单独压缩的块创建gzip流的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我喜欢能够使用并发CPU线程生成gzip(.gz)文件。也就是说,我将使用单独初始化的 z_stream 记录从输入文件中分离单独的块。

I like to be able to generate a gzip (.gz) file using concurrent CPU threads. I.e., I would be deflating separate chunks from the input file with separately initialized z_stream records.

在经典的单线程操作中可以通过zlib的inflate()函数读取。

The resulting file should be readable by zlib's inflate() function in a classic single threaded operation.

这是可能吗?即使它需要自定义的zlib代码吗?唯一的要求是当前存在的zlib的inflate代码可以处理它。

Is that possible? Even if it requires customized zlib code? The only requirement would be that the currently existing zlib's inflate code could handle it.

更新

pigz 源代码演示了它的工作原理。它使用一些复杂的优化来在块之间共享字典,保持压缩率最优。如果使用更新的zlib版本,它还会进一步处理位打包。

The pigz source code demonstrates how it works. It uses some sophisticated optimizations to share the dictionary between chunks, keeping the compression rate optimal. It further handles bit packing if a more recent zlib version is used.

Howevever,我喜欢了解如何滚动我自己的,保持简单,没有优化 pigz 使用。

Howevever, I like to understand how to roll my own, keeping things simple, without the optimizations pigz uses.

虽然许多人认为源代码是最终的文档( Ed Post,任何人?)我宁愿用简单的语言解释,以避免误解。 (虽然文档实际上描述了发生的很好,但他们不能很好地解释需要做什么来滚动自己的。)

And while many consider source code to be the ultimate documentation (Ed Post, anyone?) I rather have it explained in plain words to avoid misunderstandings. (While the docs actually describe what happens pretty well, they do not explain too well what needs to be done to roll one's own.)

从浏览代码,我想到目前为止:

From browsing the code, I figured out this much so far:

看来,使用 deflate(...,Z_SYNC_FLUSH),而不是使用 Z_FINISH 。但是, deflateEnd()给出一个错误,不知道是否可以忽略。并且需要手动计算所有块的最终校验和,虽然我不知道如何添加校验和在结束。还有一个相当复杂的 put_trailer()函数用于编写gzip头 - 我想知道这是否也可以由zlib自己的代码处理简单的情况?

It appears that one simply creates each compressed chunk using deflate(..., Z_SYNC_FLUSH) instead of using Z_FINISH. However, deflateEnd() gives an error then, not sure if that can be ignored. And one needs to calculate the final checksum over all chunks manually, though I wonder how to add the checksum at the end. There is also a rather complex put_trailer() function for writing a gzip header - I wonder if that could also be handled by zlib's own code for simple cases?

对此有任何澄清。

此外,我意识到我应该包括询问编写一个zlib流以同样的方式,以便将多线程压缩文件写入zip存档。在这里,我怀疑,由于缺少更复杂的gzip头,可能会有更多的简化。

Also, I realize that I should have included asking about writing a zlib stream the same way, in order to write multithreaded-compressed files to a zip archive. There, I suspect, more simplifications are possible due to the lack of the more complex gzip header.

推荐答案

题。每个线程都有自己的 deflate 实例来生成原始的deflate数据(参见 deflateInit2()),以 Z_SYNC_FLUSH 而不是 Z_FINISH 结尾的数据。除了最后一个数据块,它以一个 Z_FINISH 结束。无论哪种方式,这都结束了在字节边界上的每个得到的压缩数据流。确保您从 deflate()中获取所有生成的数据。然后,您可以连接所有压缩的数据流。 (以正确的顺序!)前面带有你自己生成的gzip标头。这很重要(请参见 RFC 1952 )。如果您不需要包含在标题中的任何附加信息(例如文件名,修改日期),它可以是一个常量10字节序列。

The answer is in your question. Each thread has its own deflate instance to produce raw deflate data (see deflateInit2()), which compresses the chunk of the data fed to it, ending with Z_SYNC_FLUSH instead of Z_FINISH. Except for the last chunk of data, which you end with a Z_FINISH. Either way, this ends each resulting stream of compressed data on a byte boundary. Make sure that you get all of the generated data out of deflate(). Then you can concatenate all the compressed data streams. (In the correct order!) Precede that with a gzip header that you generate yourself. It is trivial to do that (see RFC 1952). It can just be a constant 10-byte sequence if you don't need any additional information included in the header (e.g. file name, modification date). The gzip header is not complex.

您还可以计算同一个线程或不同线程中每个未压缩块的CRC-32,并将这些CRC-32的结合使用 crc32_combine()

You can also compute the CRC-32 of each uncompressed chunk in the same thread or a different thread, and combine those CRC-32's using crc32_combine(). You need that for the gzip trailer.

在所有压缩流被写入之后,结束于以 Z_FINISH ,您附加了gzip预告片。所有的是四字节CRC-32和总的未压缩长度的低四个字节,都是以小端顺序。

After all of the compressed streams are written, ending with the compressed stream that was ended with a Z_FINISH, you append the gzip trailer. All that is is the four-byte CRC-32 and the low four bytes of the total uncompressed length, both in little-endian order. Eight bytes total.

在每个线程中,您可以使用 deflateEnd()完成每个块,您正在重复使用线程更多的块,使用 deflateReset()。我在pigz中发现,在处理多个块时,打开线程和 deflate 实例是更有效的。只需确保在关闭线程之前对线程处理的最后一个块使用 deflateEnd()。是,可以忽略 deflateEnd()的错误。只需确保您已经运行 deflate(),直到 avail_out 不为零,即可获取所有压缩数据。

In each thread you can either use deflateEnd() when done with each chunk, or if you are reusing threads for more chunks, use deflateReset(). I found in pigz that it is much more efficient to leave threads open and deflate instances open in them when processing multiple chunks. Just make sure to use deflateEnd() for the last chunk that thread processes, before closing the thread. Yes, the error from deflateEnd() can be ignored. Just make sure that you've run deflate() until avail_out is not zero to get all of the compressed data.

这样,每个线程压缩其chunk而不引用任何其他未压缩的数据,这种引用通常会在连续执行时提高压缩率。如果要获得更高级的功能,可以为每个线程提供未压缩数据的块,以压缩上一个块的最后32K,以提供压缩器的历史记录。您可以使用 deflateSetDictionary()来执行此操作。

Doing this, each thread compresses its chunk with no reference to any other uncompressed data, where such references would normally improve the compression when doing it serially. If you want to get more advanced, you can feed each thread the chunk of uncompressed data to compress, and the last 32K of the previous chunk to provide history for the compressor. You do this with deflateSetDictionary().

有时使用 Z_PARTIAL_FLUSH ,直到到达字节边界。有关详细信息,请参阅pigz。

Still more advanced, you can reduce the number of bytes inserted between compressed streams by sometimes using Z_PARTIAL_FLUSH's until getting to a byte boundary. See pigz for the details of that.

更高级,但更慢,您可以在位级别而不是字节级别附加压缩流。这将需要移动压缩流的每个字节两次以构建新的移位流。至少对于每八个先前压缩流中的七个。

Even more advanced, but slower, you can append compressed streams at the bit level instead of the byte level. That would require shifting every byte of the compressed stream twice to build a new shifted stream. At least for seven out of every eight preceding compressed streams. This eliminates all of the extra bits inserted between compressed streams.

使用 adler32_combine()可以生成完全相同的zlib流。 / code>用于校验和。

A zlib stream can be generated in exactly the same way, using adler32_combine() for the checksum.

关于zlib的问题意味着混淆。 zip格式不使用zlib标头和预告片。 zip有自己的结构,其中包含原始泄漏流。您也可以使用上述方法处理这些原始泄漏流。

Your question about zlib implies a confusion. The zip format does not use the zlib header and trailer. zip has its own structure, within which is imbedded raw deflate streams. You can use the above approach for those raw deflate streams as well.

这篇关于从单独压缩的块创建gzip流的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆