将压缩的数据包装为gzip格式 [英] Wrap deflated data in gzip format

查看:195
本文介绍了将压缩的数据包装为gzip格式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我认为我缺少一些非常简单的东西。我有一个字节数组,保存有使用Deflater写入的压缩数据:

I think I'm missing something very simple. I have a byte array holding deflated data written into it using a Deflater:

deflate(outData, 0, BLOCK_SIZE, SYNC_FLUSH)

我之所以不仅仅使用GZIPOutputStream是因为每个线程都有4个(变量)线程一个数据块,每个线程压缩它自己的块,然后将压缩的数据存储到全局字节数组中。如果我使用GZIPOutputStream,它会弄乱格式,因为每个小块都有标题和尾部,并且是它自己的gzip数据(我只想压缩它)。

The reason I didn't just use GZIPOutputStream was because there were 4 threads (variable) that each were given a block of data and each thread compressed it's own block before storing that compressed data into a global byte array. If I used GZIPOutputStream it messes up the format because each little block has a header and trailer and is it's own gzip data (I only want to compress it).

所以在最后,我有了这个byteArray,outData,它保存了我所有的压缩数据,但是我不确定如何包装它。 GZIPOutputStream从缓冲区写入未压缩的数据,但此数组已全部设置。它已经被压缩了,我正碰壁,试图弄清楚如何将其转化为表单。

So in the end, I've got this byteArray, outData, that's holding all of my compressed data but I'm not really sure how to wrap it. GZIPOutputStream writes from an buffer with uncompressed data, but this array is all set. It's already compressed and I'm just hitting a wall trying to figure out how to get it into a form.

编辑:好的,我的措辞不好。我将其写入输出而不是文件中,以便可以根据需要重定向。

Ok, bad wording on my part. I'm writing it to output, not a file, so that it could be redirected if needed. A really simple example is that

cat file.txt | java Jzip | gzip -d | cmp file.txt

应返回0。现在的问题是如果我按原样写此字节数组输出,只是原始压缩数据。我认为gzip需要所有这些额外信息。

should return 0. The problem right now is if I write this byte array as is to output, it's just "raw" compressed data. I think gzip needs all this extra information.

如果有其他方法,那就可以了。之所以这样,是因为我需要使用多个线程。否则,我将只调用GZIPOutputStream。

If there's an alternative method, that would be fine to. The whole reason it's like this is because I needed to use multiple threads. Otherwise I would just call GZIPOutputStream.

双重编辑:由于注释提供了很多很好的见识,因此另一种方法是,我只有一堆未压缩的数据块,原本是一条漫长的溪流。如果gzip可以读取串联的流,那么如果我采用了这些块(并使它们保持顺序),并将每个块交给一个在其自己的块上调用GZIPOutputStream的线程,则采用结果并将它们串联起来。本质上,每个块现在都有标题,压缩的信息和尾部。 gzip是否可以识别我是否将它们串联起来?

DOUBLE Since the comments provide a lot of good insight, another method is that I just have a bunch of uncompressed blocks of data that were originally one long stream. If gzip can read concatenated streams, if I took those blocks (and kept them in order) and gave each one to a thread that calls GZIPOutputStream on its own block, then took the results and concatenated them. In essence, each block now has header, the compressed info, and trailer. Would gzip recognize that if I concatenated them?

示例:

cat file.txt
Hello world! How are you? I'm ready to set fire to this assignment.

java Testcase < file.txt > file.txt.gz

所以我接受了输入。在该程序内部,流分为
Hello world!。 你好吗? 我已经准备好为此任务着火了(它们不是字符串,而只是字节数组!这只是说明而已)。

So I accept it from input. Inside the program, the stream is split up into "Hello world!" "How are you?" "I'm ready to set fire to this assignment" (they're not strings, it's just an array of bytes! this is just illustration)

得到了这三个字节块,都未压缩。我将每个这些块都分配给一个线程,该线程使用

So I've got these three blocks of bytes, all uncompressed. I give each of these blocks to a thread, which uses

public static class DGZIPOutputStream extends GZIPOutputStream
{
    public DGZIPOutputStream(OutputStream out, boolean flush) throws IOException
    {
        super(out, flush);
    }
    public void setDictionary(byte[] b)
    {
        def.setDictionary(b);
    }
    public void updateCRC(byte[] input)
    {
        crc.update(input);
    }                       
}

您可以看到,唯一的是我已将flush设置为SYNC_FLUSH,这样我就可以正确对齐并可以设置字典。如果每个线程都使用DGZIPOutputStream(我已经对其进行测试,并且可以用于一个长的连续输入),并且我将这三个块(现在已分别用头文件和尾文件压缩)串联起来,那么gzip -d file.txt.gz会起作用?

As you can see, the only thing here is that I've set the flush to SYNC_FLUSH so I can get the alignment right and have the ability to set the dictionary. If each thread were to use DGZIPOutputStream (which I've tested and it works for one long continuous input), and I concatenated those three blocks (now compressed each with a header and trailer), would gzip -d file.txt.gz work?

如果太奇怪了,请完全忽略字典。没关系。我只是在添加它时添加了它。

If that's too weird, ignore the dictionary completely. It doesn't really matter. I just added it in while I was at it.

推荐答案

如果设置了 nowrap 使用 Deflater (sic)构造函数时为true,则结果为原始放气。否则,它是zlib,您将不得不剥离zlib标头和尾部。对于其余的答案,我假设 nowrap 是正确的。

If you set nowrap true when using the Deflater (sic) constructor, then the result is raw deflate. Otherwise it's zlib, and you would have to strip the zlib header and trailer. For the rest of the answer, I am assuming nowrap is true.

要包装完整的终止的deflate流要成为gzip流,您需要前置十个字节:

To wrap a complete, terminated deflate stream to be a gzip stream, you need to prepend ten bytes:

"\x1f\x8b\x08\0\0\0\0\0\0\xff"

(对不起-C格式,您需要将其转换为Java八进制)。您还需要以小尾数顺序附加四字节CRC,然后以小尾数顺序附加四字节总未压缩长度的模2 ^ 32。鉴于标准Java API中可用的功能,您需要串行计算CRC。无法并行完成。 zlib 确实具有合并单独计算的,并行计算但在Java中未公开的CRC的功能。

(sorry -- C format, you'll need to convert to Java octal). You need to also append the four byte CRC in little endian order, followed by the four-byte total uncompressed length modulo 2^32, also in little endian order. Given what is available in the standard Java API, you'll need to compute the CRC in serial. It can't be done in parallel. zlib does have a function to combine separate CRCs that are computed in parallel, but that is not exposed in Java.

请注意,我说的是完整的,终止的放气流。使其中一个执行并行deflate任务时要格外小心。您将需要使 n-1 不终止的放气流和一个最终终止的放气流并将它们连接起来。最后一个是正常制作的。另一个 n-1 需要使用同步刷新终止,以便在字节边界上结束每个并且不将其标记为流的结尾。为此,您可以使用 放气 ,使用刷新参数 SYNC_FLUSH 。请勿在这些文件上使用 finish()

Note that I said a complete, terminated deflate stream. It takes some care to make one of those with parallel deflate tasks. You would need to make n-1 unterminated deflate streams and one final terminated deflate stream and concatenate those. The last one is made normally. The other n-1 need to be terminated using sync flush in order to end each on a byte boundary and to not mark it as the end of the stream. To do that, you use deflate with the flush parameter SYNC_FLUSH. Don't use finish() on those.

为了获得更好的压缩效果,可以使用在每个块上 setDictionary ,前一个块的最后32K。

For better compression, you can use setDictionary on each chunk with the last 32K of the previous chunk.

这篇关于将压缩的数据包装为gzip格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆