COM preSS大整数到尽可能小的字符串 [英] Compress large Integers into smallest possible string

查看:252
本文介绍了COM preSS大整数到尽可能小的字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一堆的那个,我传递一个URL 10位整数。就像是:
4294965286,2292964213。他们将永远是积极的,永远是10位数字。

I have a bunch of 10 digit integers that I'm passing in a URL. Something like: "4294965286", "2292964213". They will always be positive and always be 10 digits.

我想COM preSS这些整数成仍然可以在URL中使用尽可能小的形状(即字母和数字是完全没有),后来再uncom preSS他们。我看着使用gzipstream但它产生较大的字符串,而不是更短。

I'd like to compress those integers into the smallest possible form that can still be used in in a URL (aka letters and numbers are perfectly fine) and then uncompress them later. I've looked at using gzipstream but it creates larger strings, not shorter.

我目前使用asp.net所以vb.net或C#解决方案是最好的。

I'm currently using asp.net so a vb.net or c# solution would be best.

感谢

推荐答案

是的。 GZIP是一个的玉米pression 的算法既需要玉米pressible数据和具有开销(成帧和字典等)。一个的编码的算法应该使用。

Yes. GZIP is a compression algorithm which both requires compressible data and has an overhead (framing and dictionaries, etc). An encoding algorithm should be used instead.

简单的方法是使用基地-64编码

The "simple" method is to use base-64 encoding.

即,转换数(这是重新串中psented作为基座10 $ P $),以实际的一系列重present的数目的字节(5字节将包括一个10位十进制数),并随后的Base-64导致。每个基站64字符存储的6比特信息(到小数〜3.3位/字符),因此,将导致一个大小约为刚刚超过一半(在这种情况下,需要6 *基本64输出字符)。

That is, convert the number (which is represented as base 10 in the string) to the actual series of bytes that represent the number (5 bytes will cover a 10 digit decimal number) and then base-64 that result. Each base-64 character stores 6 bits of information (to the decimals ~3.3 bits/character) and will thus result in a size of approximately just over half (in this case, 6* base-64 output characters are required).

此外,由于输入/输出的长度是从数据本身获得,123可能是最初(即基本64烯codeD前)变换为1字节,30000作为2个字节,等等。如果不是所有的数字是大致相同的长度,这将是有利的。

Additionally, since the input/output lengths are obtainable from the data itself, "123" might be originally (before being base-64 encoded) converted as 1 byte, "30000" as 2 bytes, etc. This would be advantageous if not all the numbers are approximately the same length.

快乐编码。

* 是基本需要64 6输出的字符

编辑:我错了最初的,我说,2.3比特/字符十进制和建议,被要求不到一半的字符。我已经更新上面的答案,在这里,展现了(应该是正确的),数学,其中 LG(N)是登录到基地2。

I was wrong initially where I said "2.3 bits/char" for decimal and proposed that less than half the characters were required. I have updated the answer above and show the (should be correct) math here, where lg(n) is log to the base 2.

重新present所需输入的比特数的输入号比特/字符*字符 - > LG(10)* 10 (或只是 LG(9999999999)) - > 〜33.2位。使用jball的操纵先转移的数量,所需的比特数 LG(8999999999) - > 〜33.06位 。然而这种转变是不能够提高效率的在该特定情况下的(输入比特数需要被降低到30或以下,使这里的差)。

The number of input bits required to represent the input number is bits/char * chars -> lg(10) * 10 (or just lg(9999999999)) -> ~33.2 bits. Using jball's manipulation to shift the number first, the number of bits required is lg(8999999999) -> ~33.06 bits. However this transformation isn't able to increase the efficiency in this particular case (the number of input bits would need to be reduced to 30 or below to make a difference here).

所以,我们试图找到一个x(字符数在基64的编码),使得:

So we try to find an x (number of characters in base-64 encoding) such that:

LG(64)* X = 33.2 - > 6 * X = 33.2 - > X〜5.53 。当然,五年半的人物是没有意义的,所以我们选择6作为的最大的基数-64编码℃的价值要求EN $ C $高达999999999的字符数。这比原来的一半10个字符的略多。

lg(64) * x = 33.2 -> 6 * x = 33.2 -> x ~ 5.53. Of course five and a half characters is nonsensical so we choose 6 as the maximum number of characters required to encode a value up to 999999999 in base-64 encoding. This is slightly more than half of the original 10 characters.

然而,应该指出的是,以获得仅6在基本64输出字符需要一个非标准的碱基64烯codeR或操纵一点点(最基-64烯$ C $仅CRS在整个工作字节)。这工作,因为出仅使用40位的34原5所需的字节(前6位始终为0)。这将需要7个基地64个字符EN code所有40位。

However, it should be noted that to obtain only 6 characters in base-64 output requires a non-standard base-64 encoder or a little bit of manipulation (most base-64 encoders only work on whole bytes). This works because out of the original 5 "required bytes" only 34 of the 40 bits are used (the top 6 bits are always 0). It would require 7 base-64 characters to encode all 40 bits.

下面是code,它Guffa张贴在他的回答中修改(如果你喜欢它,去给他一个了票),只需要6基64个字符。请参阅Guffa的回答和的Base64其他笔记URL应用如下方法做的使用URL友好的映射。

Here is a modification of the code that Guffa posted in his answer (if you like it, go give him an up-vote) that only requires 6 base-64 characters. Please see other notes in Guffa's answer and Base64 for URL applications as the method below does not use a URL-friendly mapping.

byte[] data = BitConverter.GetBytes(value);
// make data big-endian if needed
if (BitConverter.IsLittleEndian) {
   Array.Reverse(data);
}
// first 5 base-64 character always "A" (as first 30 bits always zero)
// only need to keep the 6 characters (36 bits) at the end 
string base64 = Convert.ToBase64String(data, 0, 8).Substring(5,6);

byte[] data2 = new byte[8];
// add back in all the characters removed during encoding
Convert.FromBase64String("AAAAA" + base64 + "=").CopyTo(data2, 0);
// reverse again from big to little-endian
if (BitConverter.IsLittleEndian) {
   Array.Reverse(data2);
}
long decoded = BitConverter.ToInt64(data2, 0);


使之成为prettier

由于基地-64已被确定为使用6个字符,然后进行任何编码变种仍然连接codeS输入位为6个字符,将创建就像小的输出。使用基地-32​​编码不会很晋级,在基32编码6个字符只能存储30比特的信息( LG(32)* 6 )。

Since base-64 has been determined to use 6 characters then any encoding variant that still encodes the input bits into 6 characters will create just as small an output. Using a base-32 encoding won't quite make the cut, as in base-32 encoding 6 character can only store 30 bits of information (lg(32) * 6).

然而,相同的输出大小可以与自定义碱-48(或52/62)编码来实现的。 (基48-62的优点是,它们只要求的字母数字字符的一个子集,并且不需要符号;任选样1和I,可避免对变体模糊符号)。用碱-48系统中的6个字符可以连接code〜33.5位( LG(48)* 6 )的这仅仅是上述信息〜33.2(或〜33.06)位( LG(10)* 10 )必需的。

However, the same output size could be achieved with a custom base-48 (or 52/62) encoding. (The advantage of a base 48-62 is that they only requires a subset of alpha-numeric characters and do not need symbols; optionally "ambiguous" symbols like 1 and "I" can be avoided for variants). With a base-48 system the 6 characters can encode ~33.5 bits (lg(48) * 6) of information which is just above the ~33.2 (or ~33.06) bits (lg(10) * 10) required.

下面就是一个证明的概念:

Here is a proof-of-concept:

// This does not "pad" values
string Encode(long inp, IEnumerable<char> map) {
    Debug.Assert(inp >= 0, "not implemented for negative numbers");

    var b = map.Count();
    // value -> character
    var toChar = map.Select((v, i) => new {Value = v, Index = i}).ToDictionary(i => i.Index, i => i.Value);
    var res = "";
    if (inp == 0) {
      return "" + toChar[0];
    }
    while (inp > 0) {
      // encoded least-to-most significant
      var val = (int)(inp % b);
      inp = inp / b;
      res += toChar[val];
    }
    return res;
}

long Decode(string encoded, IEnumerable<char> map) {
    var b = map.Count();
    // character -> value
    var toVal = map.Select((v, i) => new {Value = v, Index = i}).ToDictionary(i => i.Value, i => i.Index);      
    long res = 0;
    // go in reverse to mirror encoding
    for (var i = encoded.Length - 1; i >= 0; i--) {
      var ch = encoded[i];
      var val = toVal[ch];
      res = (res * b) + val;
    }
    return res;
}

void Main()
{
    // for a 48-bit base, omits l/L, 1, i/I, o/O, 0
    var map = new char [] {
        'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'J', 'K',
        'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W',
        'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g',
        'h', 'j', 'k', 'm', 'n', 'p', 'q', 'r', 's', 't',
        'u', 'v', 'x', 'y', 'z', '2', '3', '4',
    };
    var test = new long[] {0, 1, 9999999999, 4294965286, 2292964213, 1000000000};
    foreach (var t in test) {
        var encoded = Encode(t, map);
        var decoded = Decode(encoded, map);
        Console.WriteLine(string.Format("value: {0} encoded: {1}", t, encoded));
        if (t != decoded) {
            throw new Exception("failed for " + t);
        }
    }
}

的结果是:

value: 0 encoded: A
value: 1 encoded: B
value: 9999999999 encoded: SrYsNt
value: 4294965286 encoded: ZNGEvT
value: 2292964213 encoded: rHd24J
value: 1000000000 encoded: TrNVzD


以上考虑,其中数字是随机和不透明的情况;也就是说,没有什么可以对数的内部来确定。但是,如果有一个定义的结构(例如第7,第8和第9位始终为零,第2和第15位总是相同),然后 - 当且仅当信息4或更多的比特可以是<青霉>消除<从输入/ em>的 - 只有5个碱基64个字符将需要。增加的复杂性和倚赖结构很可能超过任何边际效益。


The above considers the case where the numbers are "random and opaque"; that is, there is nothing that can be determined about the internals of the number. However, if there is a defined structure (e.g. 7th, 8th, and 9th bits are always zero and 2nd and 15th bits are always the same) then -- if and only if 4 or more bits of information can be eliminated from the input -- only 5 base-64 characters would be required. The added complexities and reliance upon the structure very likely outweigh any marginal gain.

这篇关于COM preSS大整数到尽可能小的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆