压缩一组大整数 [英] Compressing a set of large integers

查看:98
本文介绍了压缩一组大整数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一组整数,我希望它们具有最紧凑的表示形式。
我具有以下约束/特征:

I have a set of integers, for which I would like to have the most compact representation. I have the following constraints/features:


  • 已设置,换句话说,其中包含一个唯一整数的列表顺序无关紧要。

  • 集合L的大小相对较小(通常为1000个元素)

  • 整数遵循从0到0的均匀分布N-1,其中N相对较大(例如2 ^ 32)

  • 对压缩集的元素的访问是随机的,但是如果解压缩过程不太快,那就很好了

  • 压缩应该是无损的

  • it is set, or in other words, a list of unique integers in which the order does not matter
  • the size of the set L is relatively small (typically 1000 elements)
  • the integers follow a uniform distribution between 0 and N-1, with N relatively large (say 2^32)
  • the access to the elements of the compressed set are random, but it is fine if the decompression procedure is not so fast
  • the compression should be lossless, obviously

我已经尝试了一些方法,但是我对结果不满意,我以某种方式确信存在更好的解决方案:

I have tried a few things, but I am not satisfied by the results, and I am somehow convinced that a better solution exists:


  • 增量编码(排序,然后编码差异),或者还进行排序,然后对第i个元素与i * N / L之间的差异进行编码。两者都给出了合理的结果,但是却不是很好,这可能是因为N和L的典型大小。霍夫曼编码增量并没有帮助,因为它们通常很大。

  • 递归范围缩小( http://ygdes.com/ddj-3r/ddj-3r_compact.html ) 。这似乎很聪明,但是对指数递减的整数最有效,在这里绝对不是这种情况。

  • 这里关于stackoverflow的一些讨论是相似的,但并不完全等同于我的问题(用于压缩连续正整数的C库压缩排序的整数

  • delta encoding (sorting, then encoding differences), or also sorting, then encoding differences between the i-th element and i*N/L. Both give reasonable results, but not great, probably because of the typical sizes of N and L. Huffman coding the deltas does not help because they are typically big.
  • recursive range reduction ( http://ygdes.com/ddj-3r/ddj-3r_compact.html ). This seems smart, but works best on exponentially decreasing integers, which is definitely not the case here.
  • a few discussions here on stackoverflow are similar but not completely equivalent to my problem ( C Library for compressing sequential positive integers, Compress sorted integers )

我很高兴听听您可能有什么想法。

I would be glad to hear any ideas you might have. Thanks in advance!

更新:

事实证明,增量编码似乎近似最佳解决方案。对于集合中元素的其他其他分布而言,这可能会有所不同。

It turns out that delta encoding seems to approximate the optimal solution. This might be different for other other distributions of the elements in the set.

推荐答案

您可以尽力而为通过计数来做。 (我希望stackoverflow允许TeX方程,如math.stackexchange。无论如何...)

You can get an idea of the best you can do by counting. (I wish stackoverflow allowed TeX equations like math.stackexchange. Anyway ...)

ceiling(log(Combination(2^32,1000)) / (8 * log(2))) = 2934

例如,选择是均匀分布的,对于该特定情况,您平均希望获得的最佳压缩率是2934个字节。最佳比率是4000字节未编码表示形式的73.35%。

So if, as you say, the choices are uniformly distributed, the best compression you could hope for on average for that particular case is 2934 bytes. The best ratio is 73.35% of the unencoded representation of 4000 bytes.

Combination(2 ^ 32,1000)只是压缩算法可能输入的总数。如果它们均匀分布,则最佳编码是一个巨型整数,该巨型整数通过索引标识每个可能的输入。每个巨型整数值唯一地标识输入之一。想象一下在巨型表中通过索引查找输入。 ceiling(log(Combination(2 ^ 32,1000))/ log(2))是该索引整数所需的位数。

Combination(2^32,1000) is simply the total number of possible inputs to the compression algorithm. If those are uniformly distributed, then the optimal coding is one giant integer that identifies each possible input by an index. Each giant integer value uniquely identifies one of the inputs. Imagine looking up the input by index in a giant table. ceiling(log(Combination(2^32,1000)) / log(2)) is how many bits you need for that index integer.

更新:

我找到了一种使用现成的压缩方法来接近理论最佳值的方法工具。我进行排序,应用增量编码,并从中减去一个(因为连续的不同元素之间的增量至少是一个)。然后的诀窍是,我先写出所有高字节,然后写出下一个最高有效字节,等等。增量减一的高字节趋向于零,以便将许多零组合在一起,这是标准压缩实用程序喜欢的。同样,下一组字节也倾向于偏向较低的值。

I found a way to get close to the theoretical best using off-the-shelf compression tools. I sort, apply delta coding, and subtract one from that (since the delta between successive distinct elements is at least one). Then the trick is that I write out all the high bytes, then the next most significant bytes, etc. The high bytes of the deltas minus one tend to be zero, so that groups a lot of zeros together, which the standard compression utilities love. Also the next set of bytes tend to be biased to low values.

对于示例(从0..2 ^ 32-1开始的1000个均匀且不同的样本),I通过 gzip -9 运行时平均获得3110字节,通过 xz -9 获得3098字节(xz使用与7zip相同的压缩LZMA)。它们非常接近2934的理论最佳平均值。另外,对于标头和尾标,gzip的开销为18字节,xz的开销为24字节。因此,与理论上最好的比较, gzip -9 为3092, xz -9 为3074。比理论上的最佳值大约5%。

For the example (1000 uniform and distinct samples from 0..2^32-1), I get an average of 3110 bytes when running that through gzip -9, and 3098 bytes through xz -9 (xz uses the same compression, LZMA, as 7zip). Those are pretty close to the theoretical best average of 2934. Also gzip has an overhead of 18 bytes, and xz has an overhead of 24 bytes, both for headers and trailers. So a fairer comparison with the theoretical best would be 3092 for gzip -9 and 3074 for xz -9. Around 5% larger than the theoretical best.

更新2:

I实现了排列的直接编码,平均达到2974字节,仅比理论上的最佳值高出1%多一点。我使用了 GNU多精度算术库,以一个巨型整数对每个排列的索引进行编码。编码和解码的实际代码如下所示。我在 mpz _ * 函数中添加了注释,但从名称上可能看不出来它们在执行什么算术运算。

I implemented direct encoding of the permutations, and achieved an average of 2974 bytes, which is only a little over 1% more than the theoretical best. I used the GNU multiple precision arithmetic library to encode an index for each permutation in a giant integer. The actual code for the encoding and decoding is shown below. I added comments for the mpz_* functions where it may not be obvious from the name what arithmetic operations they're doing.

/* Recursively code the members in set[] between low and high (low and high
   themselves have already been coded).  First code the middle member 'mid'.
   Then recursively code the members between low and mid, and then between mid
   and high. */
local void combination_encode_between(mpz_t pack, mpz_t base,
                                      const unsigned long *set,
                                      int low, int high)
{
    int mid;

    /* compute the middle position -- if there is nothing between low and high,
       then return immediately (also in that case, verify that set[] is sorted
       in ascending order) */
    mid = (low + high) >> 1;
    if (mid == low) {
        assert(set[low] < set[high]);
        return;
    }

    /* code set[mid] into pack, and update base with the number of possible
       set[mid] values between set[low] and set[high] for the next coded
       member */
        /* pack += base * (set[mid] - set[low] - 1) */
    mpz_addmul_ui(pack, base, set[mid] - set[low] - 1);
        /* base *= set[high] - set[low] - 1 */
    mpz_mul_ui(base, base, set[high] - set[low] - 1);

    /* code the rest between low and high */
    combination_encode_between(pack, base, set, low, mid);
    combination_encode_between(pack, base, set, mid, high);
}

/* Encode the set of integers set[0..num-1], where each element is a unique
   integer in the range 0..max.  No value appears more than once in set[]
   (hence the name "set").  The elements of set[] must be sorted in ascending
   order. */
local void combination_encode(mpz_t pack, const unsigned long *set, int num,
                              unsigned long max)
{
    mpz_t base;

    /* handle degenerate cases and verify last member <= max -- code set[0]
       into pack as simply itself and set base to the number of possible set[0]
       values for coding the next member */
    if (num < 1) {
            /* pack = 0 */
        mpz_set_ui(pack, 0);
        return;
    }
        /* pack = set[0] */
    mpz_set_ui(pack, set[0]);
    if (num < 2) {
        assert(set[0] <= max);
        return;
    }
    assert(set[num - 1] <= max);
        /* base = max - num + 2 */
    mpz_init_set_ui(base, max - num + 2);

    /* code the last member of the set and update base with the number of
       possible last member values */
        /* pack += base * (set[num - 1] - set[0] - 1) */
    mpz_addmul_ui(pack, base, set[num - 1] - set[0] - 1);
        /* base *= max - set[0] */
    mpz_mul_ui(base, base, max - set[0]);

    /* encode the members between 0 and num - 1 */
    combination_encode_between(pack, base, set, 0, num - 1);
    mpz_clear(base);
}

/* Recursively decode the members in set[] between low and high (low and high
   themselves have already been decoded).  First decode the middle member
   'mid'. Then recursively decode the members between low and mid, and then
   between mid and high. */
local void combination_decode_between(mpz_t unpack, unsigned long *set,
                                      int low, int high)
{
    int mid;
    unsigned long rem;

    /* compute the middle position -- if there is nothing between low and high,
       then return immediately */
    mid = (low + high) >> 1;
    if (mid == low)
        return;

    /* extract set[mid] as the remainder of dividing unpack by the number of
       possible set[mid] values, update unpack with the quotient */
        /* div = set[high] - set[low] - 1, rem = unpack % div, unpack /= div */
    rem = mpz_fdiv_q_ui(unpack, unpack, set[high] - set[low] - 1);
    set[mid] = set[low] + 1 + rem;

    /* decode the rest between low and high */
    combination_decode_between(unpack, set, low, mid);
    combination_decode_between(unpack, set, mid, high);
}

/* Decode from pack the set of integers encoded by combination_encode(),
   putting the result in set[0..num-1].  max must be the same value used when
   encoding. */
local void combination_decode(const mpz_t pack, unsigned long *set, int num,
                              unsigned long max)
{
    mpz_t unpack;
    unsigned long rem;

    /* handle degnerate cases, returning the value of pack as the only element
       for num == 1 */
    if (num < 1)
        return;
    if (num < 2) {
            /* set[0] = (unsigned long)pack */
        set[0] = mpz_get_ui(pack);
        return;
    }

    /* extract set[0] as the remainder after dividing pack by the number of
       possible set[0] values, set unpack to the quotient */
    mpz_init(unpack);
        /* div = max - num + 2, set[0] = pack % div, unpack = pack / div */
    set[0] = mpz_fdiv_q_ui(unpack, pack, max - num + 2);

    /* extract the last member as the remainder after dividing by the number
       of possible values, taking into account the first member -- update
       unpack with the quotient */
        /* rem = unpack % max - set[0], unpack /= max - set[0] */
    rem = mpz_fdiv_q_ui(unpack, unpack, max - set[0]);
    set[num - 1] = set[0] + 1 + rem;

    /* decode the members between 0 and num - 1 */
    combination_decode_between(unpack, set, 0, num - 1);
    mpz_clear(unpack);
}

mpz _ * 函数,用于将数字写入文件并回读,或将数字导出为内存中的指定格式,然后再导回。

There are mpz_* functions for writing the number to a file and reading it back, or exporting the number to a specified format in memory, and importing it back.

这篇关于压缩一组大整数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆