将大整数压缩成尽可能小的字符串 [英] Compress large Integers into smallest possible string

查看:15
本文介绍了将大整数压缩成尽可能小的字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 URL 中传递了一堆 10 位整数.就像是:4294965286"、2292964213".它们将始终为正数且始终为 10 位数字.

我想将这些整数压缩成仍然可以在 URL 中使用的最小形式(也就是字母和数字非常好),然后稍后再解压缩它们.我看过使用 gzipstream 但它创建了更大的字符串,而不是更短的字符串.

我目前正在使用 asp.net,因此最好使用 vb.net 或 c# 解决方案.

谢谢

解决方案

是的.GZIP 是一种压缩算法,它既需要可压缩的数据,又具有开销(成帧和字典等).应改用编码算法.

简单"的方法是使用base-64编码.强>

也就是说,将数字(在字符串中表示为以 10 为基数)转换为表示该数字的实际字节序列(5 个字节将涵盖 10 位十进制数),然后将结果以 64 为基数.每个 base-64 字符存储 6 位信息(小数点约为 3.3 位/字符),因此大小将大约为一半多一点(在这种情况下,需要 6* 个 base-64 输出字符).

此外,由于输入/输出长度可从数据本身获得,123"可能最初(在被 base-64 编码之前)转换为 1 个字节,30000"转换为 2 个字节等.这将是有利的如果不是所有数字的长度都大致相同.

快乐编码.

<小时>

* 使用 base-64 需要 6 个输出字符.

一开始我错了,我说十进制为2.3 位/字符",并建议需要少于一半的字符.我已经更新了上面的答案,并在这里展示了(应该是正确的)数学,其中 lg(n) 是以 2 为基数的日志.

表示输入数字所需的输入位数为 bits/char * chars -> lg(10) * 10(或只是 lg(9999999999)) -> ~33.2 bits.使用jball的操作先将数字移位,需要的位数为lg(8999999999) -> ~33.06 bits.然而,这种转换不能提高效率在这种特殊情况下(输入位数需要减少到 30 或更少才能在此处产生影响).

所以我们尝试找到一个 x(base-64 编码的字符数),使得:

lg(64) * x = 33.2 -> 6 * x = 33.2 -> x ~ 5.53.当然,五个半字符是无意义的,因此我们选择 6 作为最大字符数,以 base-64 编码对高达 999999999 的值进行编码.这比原来的 10 个字符的一半多一点.

但是,应该注意的是,要在 base-64 输出中仅获取 6 个字符,需要非标准的 base-64 编码器或一点点操作(大多数 base-64 编码器只能处理整个字节).这是有效的,因为在原始的 5 个必需字节"中,仅使用了 40 位中的 34 位(前 6 位始终为 0).编码所有 40 位需要 7 个 base-64 字符.

这是 Guffa 在他的回答中发布的代码的修改(如果你喜欢它,去给他投票),只需要 6 个 base-64 字符.请参阅 Guffa 的回答和Base64 for URL applications 中的其他注释,因为下面的方法不要使用 URL 友好的映射.

byte[] data = BitConverter.GetBytes(value);//如果需要,使数据大端如果(BitConverter.IsLittleEndian){Array.Reverse(数据);}//前 5 个 base-64 字符总是A"(因为前 30 位总是零)//只需要保留最后的 6 个字符(36 位)string base64 = Convert.ToBase64String(data, 0, 8).Substring(5,6);字节 [] 数据 2 = 新字节 [8];//重新添加在编码过程中删除的所有字符Convert.FromBase64String("AAAAA" + base64 + "=").CopyTo(data2, 0);//再次从大端反转到小端如果(BitConverter.IsLittleEndian){Array.Reverse(data2);}长解码 = BitConverter.ToInt64(data2, 0);

<小时>

让它更漂亮"

由于 base-64 已被确定为使用 6 个字符,因此任何仍将输入位编码为 6 个字符的编码变体都将创建同样小的输出.使用 base-32 编码 不太合适,如 base-32 编码 6字符只能存储 30 位信息(lg(32) * 6).

但是,使用自定义 base-48(或 52/62)编码可以实现相同的输出大小.(基数 48-62 的优点是它们只需要字母数字字符的子集,不需要符号;对于变体,可以避免使用可选的模棱两可"符号,例如 1 和I").使用 base-48 系统,这 6 个字符可以编码 ~33.5 位(lg(48) * 6)的信息,该信息略高于 ~33.2(或 ~33.06)位(lg(10) * 10) 需要.

这是一个概念验证:

//这不会填充"值字符串编码(长输入,IEnumerable 映射){Debug.Assert(inp >= 0, "未实现负数");var b = map.Count();//值 ->特点var toChar = map.Select((v, i) => new {Value = v, Index = i}).ToDictionary(i => i.Index, i => i.Value);var res = "";如果(输入== 0){返回 "" + toChar[0];}而(输入> 0){//编码的最低到最高有效var val = (int)(inp % b);inp = inp/b;res += toChar[val];}返回资源;}长解码(字符串编码,IEnumerable 映射){var b = map.Count();//字符 ->价值var toVal = map.Select((v, i) => new {Value = v, Index = i}).ToDictionary(i => i.Value, i => i.Index);长资源= 0;//反向进行镜像编码for (var i = encoding.Length - 1; i >= 0; i--) {var ch = 编码[i];var val = toVal[ch];res = (res * b) + val;}返回资源;}无效主(){//对于 48 位基数,省略 l/L, 1, i/I, o/O, 0var map = 新字符 [] {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'J', 'K',M"、N"、P"、Q"、R"、S"、T"、U"、V"、W"、'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g','h', 'j', 'k', 'm', 'n', 'p', 'q', 'r', 's', 't','u', 'v', 'x', 'y', 'z', '2', '3', '4',};var test = new long[] {0, 1, 9999999999, 4294965286, 2292964213, 1000000000};foreach(测试中的var t){var 编码 = Encode(t, map);vardecoded = Decode(encoded, map);Console.WriteLine(string.Format("value: {0} 编码: {1}", t, encoding));如果(t != 解码){throw new Exception("失败" + t);}}}

结果是:

值:0 编码:A值:1 编码:B值:9999999999 编码:SrYsNt值:4294965286 编码:ZNGEvT值:2292964213 编码:rHd24J值:1000000000 编码:TrNVzD

<小时>

以上考虑了数字随机且不透明"的情况;也就是说,无法确定数字的内部结构.然而,如果有一个定义的结构(例如第 7、第 8 和第 9 位总是零,第 2 和第 15 位总是相同的)那么——当且仅当可以消除 4 位或更多位信息/em> 来自输入——只需要 5 个 base-64 字符.增加的复杂性和对结构的依赖很可能超过任何边际收益​​.

I have a bunch of 10 digit integers that I'm passing in a URL. Something like: "4294965286", "2292964213". They will always be positive and always be 10 digits.

I'd like to compress those integers into the smallest possible form that can still be used in in a URL (aka letters and numbers are perfectly fine) and then uncompress them later. I've looked at using gzipstream but it creates larger strings, not shorter.

I'm currently using asp.net so a vb.net or c# solution would be best.

Thanks

解决方案

Yes. GZIP is a compression algorithm which both requires compressible data and has an overhead (framing and dictionaries, etc). An encoding algorithm should be used instead.

The "simple" method is to use base-64 encoding.

That is, convert the number (which is represented as base 10 in the string) to the actual series of bytes that represent the number (5 bytes will cover a 10 digit decimal number) and then base-64 that result. Each base-64 character stores 6 bits of information (to the decimals ~3.3 bits/character) and will thus result in a size of approximately just over half (in this case, 6* base-64 output characters are required).

Additionally, since the input/output lengths are obtainable from the data itself, "123" might be originally (before being base-64 encoded) converted as 1 byte, "30000" as 2 bytes, etc. This would be advantageous if not all the numbers are approximately the same length.

Happy coding.


* Using base-64 requires 6 output characters.

Edit: I was wrong initially where I said "2.3 bits/char" for decimal and proposed that less than half the characters were required. I have updated the answer above and show the (should be correct) math here, where lg(n) is log to the base 2.

The number of input bits required to represent the input number is bits/char * chars -> lg(10) * 10 (or just lg(9999999999)) -> ~33.2 bits. Using jball's manipulation to shift the number first, the number of bits required is lg(8999999999) -> ~33.06 bits. However this transformation isn't able to increase the efficiency in this particular case (the number of input bits would need to be reduced to 30 or below to make a difference here).

So we try to find an x (number of characters in base-64 encoding) such that:

lg(64) * x = 33.2 -> 6 * x = 33.2 -> x ~ 5.53. Of course five and a half characters is nonsensical so we choose 6 as the maximum number of characters required to encode a value up to 999999999 in base-64 encoding. This is slightly more than half of the original 10 characters.

However, it should be noted that to obtain only 6 characters in base-64 output requires a non-standard base-64 encoder or a little bit of manipulation (most base-64 encoders only work on whole bytes). This works because out of the original 5 "required bytes" only 34 of the 40 bits are used (the top 6 bits are always 0). It would require 7 base-64 characters to encode all 40 bits.

Here is a modification of the code that Guffa posted in his answer (if you like it, go give him an up-vote) that only requires 6 base-64 characters. Please see other notes in Guffa's answer and Base64 for URL applications as the method below does not use a URL-friendly mapping.

byte[] data = BitConverter.GetBytes(value);
// make data big-endian if needed
if (BitConverter.IsLittleEndian) {
   Array.Reverse(data);
}
// first 5 base-64 character always "A" (as first 30 bits always zero)
// only need to keep the 6 characters (36 bits) at the end 
string base64 = Convert.ToBase64String(data, 0, 8).Substring(5,6);

byte[] data2 = new byte[8];
// add back in all the characters removed during encoding
Convert.FromBase64String("AAAAA" + base64 + "=").CopyTo(data2, 0);
// reverse again from big to little-endian
if (BitConverter.IsLittleEndian) {
   Array.Reverse(data2);
}
long decoded = BitConverter.ToInt64(data2, 0);


Making it "prettier"

Since base-64 has been determined to use 6 characters then any encoding variant that still encodes the input bits into 6 characters will create just as small an output. Using a base-32 encoding won't quite make the cut, as in base-32 encoding 6 character can only store 30 bits of information (lg(32) * 6).

However, the same output size could be achieved with a custom base-48 (or 52/62) encoding. (The advantage of a base 48-62 is that they only requires a subset of alpha-numeric characters and do not need symbols; optionally "ambiguous" symbols like 1 and "I" can be avoided for variants). With a base-48 system the 6 characters can encode ~33.5 bits (lg(48) * 6) of information which is just above the ~33.2 (or ~33.06) bits (lg(10) * 10) required.

Here is a proof-of-concept:

// This does not "pad" values
string Encode(long inp, IEnumerable<char> map) {
    Debug.Assert(inp >= 0, "not implemented for negative numbers");

    var b = map.Count();
    // value -> character
    var toChar = map.Select((v, i) => new {Value = v, Index = i}).ToDictionary(i => i.Index, i => i.Value);
    var res = "";
    if (inp == 0) {
      return "" + toChar[0];
    }
    while (inp > 0) {
      // encoded least-to-most significant
      var val = (int)(inp % b);
      inp = inp / b;
      res += toChar[val];
    }
    return res;
}

long Decode(string encoded, IEnumerable<char> map) {
    var b = map.Count();
    // character -> value
    var toVal = map.Select((v, i) => new {Value = v, Index = i}).ToDictionary(i => i.Value, i => i.Index);      
    long res = 0;
    // go in reverse to mirror encoding
    for (var i = encoded.Length - 1; i >= 0; i--) {
      var ch = encoded[i];
      var val = toVal[ch];
      res = (res * b) + val;
    }
    return res;
}

void Main()
{
    // for a 48-bit base, omits l/L, 1, i/I, o/O, 0
    var map = new char [] {
        'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'J', 'K',
        'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W',
        'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g',
        'h', 'j', 'k', 'm', 'n', 'p', 'q', 'r', 's', 't',
        'u', 'v', 'x', 'y', 'z', '2', '3', '4',
    };
    var test = new long[] {0, 1, 9999999999, 4294965286, 2292964213, 1000000000};
    foreach (var t in test) {
        var encoded = Encode(t, map);
        var decoded = Decode(encoded, map);
        Console.WriteLine(string.Format("value: {0} encoded: {1}", t, encoded));
        if (t != decoded) {
            throw new Exception("failed for " + t);
        }
    }
}

The result is:

value: 0 encoded: A
value: 1 encoded: B
value: 9999999999 encoded: SrYsNt
value: 4294965286 encoded: ZNGEvT
value: 2292964213 encoded: rHd24J
value: 1000000000 encoded: TrNVzD


The above considers the case where the numbers are "random and opaque"; that is, there is nothing that can be determined about the internals of the number. However, if there is a defined structure (e.g. 7th, 8th, and 9th bits are always zero and 2nd and 15th bits are always the same) then -- if and only if 4 or more bits of information can be eliminated from the input -- only 5 base-64 characters would be required. The added complexities and reliance upon the structure very likely outweigh any marginal gain.

这篇关于将大整数压缩成尽可能小的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆