压缩唯一的数据流 [英] Compression for a unique stream of data

查看:102
本文介绍了压缩唯一的数据流的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有大量的整数数组。每个都有几千个整数,每个整数通常与前一个相同,或者只相差一个位或两个。我想把每个数组缩小到尽可能小,以减少我的磁盘IO。

I've got a large number of integer arrays. Each one has a few thousand integers in it, and each integer is generally the same as the one before it or is different by only a single bit or two. I'd like to shrink each array down as small as possible to reduce my disk IO.

Zlib将其缩小到原始大小的25%。这很好,但我不认为它的算法是特别适合的问题。有没有人知道一个压缩库或简单的算法,可能执行更好的这种类型的信息?

Zlib shrinks it to about 25% of its original size. That's nice, but I don't think its algorithm is particularly well suited for the problem. Does anyone know a compression library or simple algorithm that might perform better for this type of information?

更新:zlib将其转换为一个数组xor deltas收缩到约20%的原始尺寸。

Update: zlib after converting it to an array of xor deltas shrinks it to about 20% of the original size.

推荐答案

如果大多数整数真的与前一个相同,并且符号间的差异通常可以表示为

If most of the integers really are the same as the previous, and the inter-symbol difference can usually be expressed as a single bit flip, this sounds like a job for XOR.

获取输入流如下:

1101
1101
1110
1110
0110

并输出:

1101
0000
0010
0000
1000

一点伪代码

compressed[0] = uncompressed[0]
loop
  compressed[i] = uncompressed[i-1] ^ uncompressed[i]

我们现在已经将大部分输出减少为0,即使更改了高位。在您使用的任何其他工具中的RLE压缩将有一个字段的一天。它将在32位整数上工作更好,它仍然可以编码一个完全不同的整数在流中弹出。

We've now reduced most of the output to 0, even when a high bit is changed. The RLE compression in any other tool you use will have a field day with this. It'll work even better on 32-bit integers, and it can still encode a radically different integer popping up in the stream. You're saved the bother of dealing with bit-packing yourself, as everything remains an int-sized quantity.

当您想要解压缩时:

uncompressed[0] = compressed[0]
loop
  uncompressed[i] = uncompressed[i-1] ^ compressed[i]

这也有一个简单的算法, ,因为它只是XOR。

This also has the advantage of being a simple algorithm that is going to run really, really fast, since it is just XOR.

这篇关于压缩唯一的数据流的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆