压缩稀疏位数组 [英] Compressing a sparse bit array

查看:76
本文介绍了压缩稀疏位数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有1024个字节(8192个位)的数组,其中大多数都是零。

I have arrays of 1024 bytes (8192 bits) which are mostly zero.

将设置0.01%到10%的位(随机,无模式)

Between 0.01% and 10% of bits will be set (random, no pattern).

鉴于缺乏结构且尺寸相对较小,如何压缩这些文件?

How could these be compressed, given the lack of structure and the relatively small size?

(我的首先想到的是存储设置位之间的距离,每个距离我需要13位,但是在最坏的情况下,占用10%的空间需要 13 * 816/8 = 1326 个字节,这没有改善。)

(My first thought was to store the distances between set bits. I need 13 bits for each distance, but at worst case 10% occupancy this needs 13 * 816 / 8 = 1326 bytes, which is not an improvement.)

这是针对超低带宽通信的,因此每个字节都很重要。

This is for ultra-low bandwidth comms, so every byte matters.

推荐答案

我已经处理了类似的问题,但是我的集合更大(3000万个可能的值,介于1到30之间每个集合中有100万个元素),因此它们都从压缩中获得了更多收益,并且压缩元数据与数据大小相比微不足道。我从来没有考虑过将内容压缩为小于 uint16_t 的单位,因此,如果您开始将13位值切成小段,则我下面编写的内容可能不适用。

I've dealt deeply with a similar problem, but my sets are much bigger (30 million possible values with between 1 and 30 million elements in each set), so they both gain much more from compression and the compression metadata is insignificant compared to the size of the data. I have never gone down to squeezing things into units smaller than uint16_t, so the things I write below might not apply if you start chopping up 13 bit values into pieces. It feels like it should work, but caveat emptor.

我发现行之有效的是采用几种策略,这些策略取决于我们拥有的特定数据。好消息是,每个集合中元素的数量很好地指示了哪种压缩策略最适合特定集合。因此,您需要的所有元数据都是集合中元素的数量。在我的数据格式中,第一个也是唯一的元数据值(我将不明确,仅将其称为值,您可以按字节,16位值或13位值来压缩内容)是集合中元素的数量,剩下的只是集合元素的编码。

What I've found works is to employ several strategies that depend on the particular data we have. The good news is that the count of elements in each set is a very good indicator of which compression strategy will work best for a particular set. So all the metadata you need is a count of elements in the set. In my data format the first and only metadata value (I'll be unspecific and just call it "value", you can squeeze things in bytes, 16 bit values or 13 bit values however you feel) is the count of elements in the set, the rest is just the encoding of the set elements.

策略是:


  1. 如果集合中的元素很少,那么您做不到比表示 1,4711,8140的数组做得更好,因此在这种情况下,数据编码为:[3,1,4711 ,8140]

  1. If very few elements are in the set, you can't do better than an array that says "1, 4711, 8140", so in this case the data is encoded as: [3, 1, 4711, 8140]

如果几乎​​所有元素都在集合中,则只需跟踪不存在的元素即可。例如[8190,17,42]。

If almost all elements are in the set, you can just keep track of elements that aren't. For example [8190, 17, 42].

如果集合中大约有一半的元素,那么您几乎不能比位图做得更好,因此您得到[4000,{bitmap}],这是唯一的情况是您的数据最终要比严格未压缩的数据更长。

If around half of the elements are in the set you pretty much can't do much better than a bitmap, so you get [4000, {bitmap}], this is the only case where your data ends up being longer than strictly uncompressed.

如果大于设置了一些但少于一半左右的元素,我发现了另一种策略。将您可能的值的位数分成两半。假设我们有2 ^ 16个(可能更容易描述,它可能适用于2 ^ 13个)可能的值。这些值分为256个范围,每个范围有256个可能的值。然后,我们有了一个包含256个字节的数组,这些字节中的每个字节描述了每个数组后立即有多少个值(因此字节0告诉我们有多少元素[0,255],字节1给我们[256,511]等)。其值在每个范围mod 256中。这里的技巧是,虽然集合中每个元素编码为数组(策略1)将为2个字节,但是在此方案中,每个元素的计数仅为1个字节+ 256个静态字节。元素。这意味着一旦集合中有超过256个元素,就可以通过从策略1切换到4来节省空间。

If more than "a few" but many fewer than "around half" elements are set, I found another strategy. Divide the bits of your possible values in the set in half. Let's say we have 2^16 (it's easier to describe, it should probably work for 2^13) possible values. The values are divided into 256 ranges with each range with 256 possible values. We then have an array with 256 bytes, each of these bytes describes how many values are in each range (so byte 0 tells us how many elements are [0,255], byte 1 gives us [256,511], etc.) immediately after follow arrays with the values in each range mod 256. The trick here is that while every element in the set encoded as an array (strategy 1) would be 2 bytes, in this scheme each element is only 1 bytes + 256 static bytes for the counts of elements. This means that as soon as we have more than 256 elements in the set this saves us space by switching from strategy 1 to 4.

策略4可以得到改进(如果您所说的数据是随机的,则可能毫无意义,但有时我的数据具有更多模式,因此对我有用。)由于在以前的编码中每个元素仍然需要8位,因此一旦元素的子数组超过32个元素(256字节),我们就可以将其存储为位图。对于在4/5到3之间切换策略来说,这也是一个很好的断点。如果此策略中的所有数组都只是位图,那么我们应该只使用策略3(比这复杂得多,但是可以很容易地预先计算策略之间的断点准确地说,您每次都会选择最有可能的有效策略。)

Strategy 4 can be refined (probably meaningless if your data is random as you mention, but my data had more patterns sometimes, so it worked for me). Since we still need 8 bits for each element in the previous encoding, as soon as a sub-array of elements goes over 32 elements (256 bytes), we can store it as a bitmap instead. This is also a good breakpoint for switching strategies between 4/5 to 3. If all the arrays in this strategy are just bitmaps, then we should just use strategy 3 (it's more complicated than that, but the breakpoint between strategies can be precomputed quite accurately that you'll end up picking the most likely efficient strategy each time).

我只是模糊地尝试了保存集合中数字之间的差额。快速实验表明,它们并没有比我上面提到的策略有效得多,具有不可预测的退化案例,但是最重要的是,我使用的应用程序确实喜欢不必对数据进行反序列化,而直接从磁盘原始使用(mmap)。

I have only vaguely tried saving deltas between numbers in the set. Quick experiments showed that they weren't really much more efficient than the strategies I mentioned above, had unpredictable degenerate cases, but most importantly, the application I work with really likes to not have to deserialise its data, just use it raw straight from disk (mmap).

这篇关于压缩稀疏位数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆