压缩唯一的数据流 [英] Compression for a unique stream of data
问题描述
我有大量的整数数组。每个都有几千个整数,每个整数通常与前一个相同,或者只相差一个位或两个。我想把每个数组缩小到尽可能小,以减少我的磁盘IO。
I've got a large number of integer arrays. Each one has a few thousand integers in it, and each integer is generally the same as the one before it or is different by only a single bit or two. I'd like to shrink each array down as small as possible to reduce my disk IO.
Zlib将其缩小到原始大小的25%。这很好,但我不认为它的算法是特别适合的问题。有没有人知道一个压缩库或简单的算法,可能执行更好的这种类型的信息?
Zlib shrinks it to about 25% of its original size. That's nice, but I don't think its algorithm is particularly well suited for the problem. Does anyone know a compression library or simple algorithm that might perform better for this type of information?
更新:zlib将其转换为一个数组xor deltas收缩到约20%的原始尺寸。
Update: zlib after converting it to an array of xor deltas shrinks it to about 20% of the original size.
推荐答案
如果大多数整数真的与前一个相同,并且符号间的差异通常可以表示为
If most of the integers really are the same as the previous, and the inter-symbol difference can usually be expressed as a single bit flip, this sounds like a job for XOR.
获取输入流如下:
1101
1101
1110
1110
0110
并输出:
1101
0000
0010
0000
1000
一点伪代码
compressed[0] = uncompressed[0]
loop
compressed[i] = uncompressed[i-1] ^ uncompressed[i]
我们现在已经将大部分输出减少为0,即使更改了高位。在您使用的任何其他工具中的RLE压缩将有一个字段的一天。它将在32位整数上工作更好,它仍然可以编码一个完全不同的整数在流中弹出。
We've now reduced most of the output to 0, even when a high bit is changed. The RLE compression in any other tool you use will have a field day with this. It'll work even better on 32-bit integers, and it can still encode a radically different integer popping up in the stream. You're saved the bother of dealing with bit-packing yourself, as everything remains an int-sized quantity.
当您想要解压缩时:
uncompressed[0] = compressed[0]
loop
uncompressed[i] = uncompressed[i-1] ^ compressed[i]
这也有一个简单的算法, ,因为它只是XOR。
This also has the advantage of being a simple algorithm that is going to run really, really fast, since it is just XOR.
这篇关于压缩唯一的数据流的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!