字节数组的 Base-N 编码 [英] Base-N encoding of a byte array

查看:24
本文介绍了字节数组的 Base-N 编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

几天前我遇到了这个 CodeReview 用于 Base-36 编码字节数组.然而,随后的答案并未涉及解码回字节数组,或者可能重用答案来执行不同基数(基数)的编码.

链接问题的答案使用 BigInteger.因此,就实现而言,可以对基数及其数字进行参数化.

BigInteger 的问题在于,我们将输入视为假定的整数.然而,我们的输入,一个字节数组,只是一系列不透明的值.

  • 如果字节数组以一系列零字节结尾,例如 {0xFF,0x7F,0x00,0x00},则在答案中使用算法时这些字节将丢失(只会编码 {0xFF,0x7F}.
  • 如果最后一个非零字节设置了符号位,则后续的零字节将被消耗,因为它被视为 BigInt 的符号分隔符.所以 {0xFF,0xFF,0x00,0x00} 只会编码为 {0xFF,0xFF,0x00}.

.NET 程序员如何使用 BigInteger 创建一个相当高效且与基数无关的编码器,具有解码支持,以及处理字节序的能力,以及解决"结尾零的能力字节丢失?

解决方案

edit [2020/01/26]:FWIW,下面的代码及其单元测试 与我的 Github 上的开源库.

edit [2016/04/19]:如果您喜欢异常,您可能希望更改一些 Decode 实现代码以抛出 InvalidDataException 而不是只是返回null.

edit [2014/09/14]:我在 Encode() 中添加了一个HACK"来处理输入中最后一个字节被签名的情况(如果你要转换为字节).我现在能想到的唯一明智的解决方案是将数组的大小调整为一个.这个案例的额外单元测试通过了,但我没有重新运行性能代码来解释这种情况.如果您可以提供帮助,请始终让您对 Encode() 的输入在末尾包含一个虚拟 0 字节,以避免额外分配.

用法

我创建了一个用三个参数初始化的 RadixEncoding 类(在代码"部分中找到):

  1. 基数作为字符串(当然长度决定了实际基数),
  2. 输入字节数组的假定字节顺序(endian),
  3. 以及用户是否希望编码/解码逻辑确认结束零字节.

要创建 Base-36 编码,使用小端输入,并考虑到以零字节结尾:

const string k_base36_digits = "0123456789abcdefghijklmnopqrstuvwxyz";var base36_no_zeros = new RadixEncoding(k_base36_digits, EndianFormat.Little, false);

然后实际执行编码/解码:

const string k_input = "A test 1234";byte[] input_bytes = System.Text.Encoding.UTF8.GetBytes(k_input);字符串编码字符串 = base36_no_zeros.Encode(input_bytes);byte[] decoded_bytes = base36_no_zeros.Decode(encoded_string);

性能

使用 Diagnostics.Stopwatch 计时,在 i7 860 @2.80GHz 上运行.定时 EXE 自行运行,而不是在调试器下运行.

使用与上面相同的 k_base36_digits 字符串、EndianFormat.Little 和 确认结尾零字节 初始化编码(即使 UTF8 字节没有任何额外的结束零字节)

将A test 1234"的 UTF8 字节编码 1,000,000 次需要 2.6567905 秒
解码相同的字符串需要 3.3916248 秒

对A test 1234.Made稍大!"的UTF8字节进行编码100,000 次需要 1.1577325 秒
解码相同的字符串需要 1.244326 秒

代码

如果您没有 CodeContracts 生成器,您将必须使用 if/throw 代码重新实现合约.

使用系统;使用 System.Collections.Generic;使用 System.Numerics;使用合同 = System.Diagnostics.Contracts.Contract;公共枚举 EndianFormat{///<summary>最低有效位顺序 (lsb)</summary>///<remarks>从右到左</remarks>///<see cref="BitConverter.IsLittleEndian"/>小的,///<summary>最高有效位顺序 (msb)</summary>///<remarks>从左到右</remarks>大的,};///<summary>将字节编码/解码为字符串</summary>///<备注>///编码的字符串总是大端排序//////<p>编码和解码采用 <b>includeProceedingZeros</b>作为变通方法的参数///对于我们的 BigInteger 实现的边缘情况.///MSDN 说 BigInteger 字节数组按 LSB->MSB 排序.所以一个字节缓冲区在///end 将在生成的编码基数字符串中忽略这些零.///如果绝对不会发生这种精度损失,则将 true 传递给 <b>includeProceedingZeros</b>///对于一点点额外的处理,它将处理零数字的填充(编码)///或字节(解码).

///<p>注意:这样做是为了解码<b>可能</b>添加比原来多的额外字节///给 Encode.</p>///</备注>//基于 http://codereview.stackexchange.com/questions/14084/base-36-encoding-of-a-byte-array/的答案公共类 RadixEncoding{const int kByteBitCount = 8;只读字符串 kDigits;只读双 kBitsPerDigit;只读 BigInteger kRadixBig;readonly EndianFormat kEndian;readonly bool kIncludeProceedingZeros;///<summary>此编码的数字基础</summary>public int Radix { get { return kDigits.Length;} }///<summary>字节的字节序输入到 Encode 并由 Decode 输出</summary>public EndianFormat Endian { get { return kEndian;} }///<summary>True 如果我们想要结束零字节被编码</summary>public bool IncludeProceedingZeros { get { return kIncludeProceedingZeros;} }公共覆盖字符串 ToS​​tring(){return string.Format("Base-{0} {1}", Radix.ToString(), kDigits);}///<summary>使用给定的字符作为基数中的数字创建基数编码器</summary>///<param name="digits">用于基数编码字符串的数字</param>///<param name="bytesEndian">字节输入到 Encode 并由 Decode 输出的 Endian 顺序</param>///<param name="includeProceedingZeros">如果我们想要编码结束零字节,则为真</param>公共基数编码(字符串数字,EndianFormat bytesEndian = EndianFormat.Little, bool includeProceedingZeros = false){Contract.Requires(digits != null);int 基数 = 数字.长度;kDigits = 数字;kBitsPerDigit = System.Math.Log(radix, 2);kRadixBig = new BigInteger(radix);kEndian = bytesEndian;kIncludeProceedingZeros = includeProceedingZeros;}//编码指定字节数所需的字符数int EncodingCharsCount(int bytesLength){return (int)Math.Ceiling((bytesLength * kByteBitCount)/kBitsPerDigit);}//解码指定数量的字符所需的字节数int DecodingBytesCount(int charsCount){return (int)Math.Ceiling((charsCount * kBitsPerDigit)/kByteBitCount);}///<summary>将字节数组编码为基数编码的字符串</summary>///<param name="bytes">要编码的字节数组</param>///<returns>编码为基数编码字符串的字节</returns>///<备注>如果<paramref name="bytes"/>长度为零,返回空字符串</remarks>公共字符串编码(字节[]字节){Contract.Requires(bytes != null);Contract.Ensures(Contract.Result() != null);//真的不必这样做,我们的代码会构建这个结果(空字符串),//但为什么不在工作前捕捉条件呢?if (bytes.Length == 0) return string.Empty;//如果数组以零结尾,将容量设置为此将帮助我们知道有多少//'padding' 我们需要添加int result_length = EncodingCharsCount(bytes.Length);//列表<>有一个(n in-place)反向方法.StringBuilder 没有.这就是为什么.var result = new List(result_length);//HACK:BigInteger 使用最后一个字节作为符号"字节.如果设置了该字节的 MSB,//我们需要用额外的 0 填充输入(即,使其为正)if ( (bytes[bytes.Length-1] & 0x80) == 0x80 )Array.Resize(ref bytes, bytes.Length+1);vardivided = new BigInteger(bytes);//IsZero 的计算没有评估dividend > 0"复杂//调用 BigInteger.CompareTo(BigInteger)而 (!dividend.IsZero){大整数余数;红利 = BigInteger.DivRem(红利,kRadixBig,出余数);int digit_index = System.Math.Abs​​((int)remainder);result.Add(kDigits[digit_index]);}如果(kIncludeProceedingZeros)for (int x = result.Count; x 0){int new_length = result.Length + DecodingBytesCount(padCount);Array.Resize(ref result, new_length);//新字节将为零,就像我们想要的那样}}#region 解码(小端)byte[] DecodeImpl(string chars, int startIndex = 0){var bi = new BigInteger();for (int x = startIndex; x = 0; x--){int i = kDigits.IndexOf(chars[x]);if (i <0) 返回空值;//无效字符双 *= kRadixBig;双 += i;}返回 bi.ToByteArray();}byte[] DecodeImplReversedWithPadding(string chars){int pad_count = 0;for (int x = chars.Length - 1; x >= 0; x--, pad_count++)如果 (chars[x] != kDigits[0]) 中断;var 结果 = DecodeImplReversed(chars, pad_count);DecodeImplPadResult(ref result, pad_count);返回结果;}#endregion///<summary>将一个基数编码的字符串解码成一个字节数组</summary>///<param name="radixChars">基数字符串</param>///<returns>解码后的字节,如果遇到无效字符则为null</returns>///<备注>///如果 是一个空字符串,返回一个零长度数组//////使用 <paramref name="IncludeProceedingZeros"/>有可能返回一个缓冲区///输入中没有的附加零字节.所以一个 4 字节的缓冲区被编码,这可能会结束///返回一个 5 字节的缓冲区,额外的字节为空.///</备注>公共字节 [] 解码(字符串 radixChars){Contract.Requires(radixChars != null);如果(kEndian == EndianFormat.Big)返回 kIncludeProceedingZeros ?DecodeImplReversedWithPadding(radixChars) : DecodeImplReversed(radixChars);别的返回 kIncludeProceedingZeros ?DecodeImplWithPadding(radixChars) : DecodeImpl(radixChars);}};

基本单元测试

使用系统;使用 Microsoft.VisualStudio.TestTools.UnitTesting;static bool ArraysCompareN(T[] 输入,T[] 输出)其中 T : IEquatable T{if (output.Length < input.Length) return false;for (int x = 0; x 

A couple of days ago I came across this CodeReview for Base-36 encoding a byte array. However, the answers that followed didn't touch on decoding back into a byte array, or possibly reusing the answer to perform encodings of different bases (radix).

The answer for the linked question uses BigInteger. So as far as implementation goes, the base and its digits could be parametrized.

The problem with BigInteger though, is that we're treating our input as an assumed integer. However, our input, a byte array, is just an opaque series of values.

  • If the byte array ends in a series of zero bytes, eg {0xFF,0x7F,0x00,0x00}, those bytes will be lost when using the algorithm in the answer (would only encode {0xFF,0x7F}.
  • If the last non-zero byte has the sign bit set then the proceeding zero byte is consumed as it's treated as the BigInt's sign delimiter. So {0xFF,0xFF,0x00,0x00} would encode only as {0xFF,0xFF,0x00}.

How could a .NET programmer use BigInteger to create a reasonably efficient and radix-agnostic encoder, with decoding support, plus the ability to handle endian-ness, and with the ability to 'work around' the ending zero bytes being lost?

解决方案

edit [2020/01/26]: FWIW, the code below along with its unit test live along side my open source libraries on Github.

edit [2016/04/19]: If you're fond of exceptions, you may wish to change some of the Decode implementation code to throw InvalidDataException instead of just returning null.

edit [2014/09/14]: I've added a 'HACK' to Encode() to handle cases where the last byte in the input is signed (if you were to convert to sbyte). Only sane solution I could think of right now is to just Resize() the array by one. Additional unit tests for this case passed, but I didn't rerun perf code to account for such cases. If you can help it, always have your input to Encode() include a dummy 0 byte at the end to avoid additional allocations.

Usage

I've created a RadixEncoding class (found in the "Code" section) which initializes with three parameters:

  1. The radix digits as a string (length determines the actual radix of course),
  2. The assumed byte ordering (endian) of input byte arrays,
  3. And whether or not the user wants the encode/decode logic to acknowledge ending zero bytes.

To create a Base-36 encoding, with little-endian input, and with respect given to ending zero bytes:

const string k_base36_digits = "0123456789abcdefghijklmnopqrstuvwxyz";
var base36_no_zeros = new RadixEncoding(k_base36_digits, EndianFormat.Little, false);

And then to actually perform encoding/decoding:

const string k_input = "A test 1234";
byte[] input_bytes = System.Text.Encoding.UTF8.GetBytes(k_input);
string encoded_string = base36_no_zeros.Encode(input_bytes);
byte[] decoded_bytes = base36_no_zeros.Decode(encoded_string);

Performance

Timed with Diagnostics.Stopwatch, ran on an i7 860 @2.80GHz. Timing EXE ran by itself, not under a debugger.

Encoding was initialized with the same k_base36_digits string from above, EndianFormat.Little, and with ending zero bytes acknowledged (even though the UTF8 bytes don't have any extra ending zero bytes)

To encode the UTF8 bytes of "A test 1234" 1,000,000 times takes 2.6567905secs
To decode the same string the same amount of times takes 3.3916248secs

To encode the UTF8 bytes of "A test 1234. Made slightly larger!" 100,000 times takes 1.1577325secs
To decode the same string the same amount of times takes 1.244326secs

Code

If you don't have a CodeContracts generator, you will have to reimplement the contracts with if/throw code.

using System;
using System.Collections.Generic;
using System.Numerics;
using Contract = System.Diagnostics.Contracts.Contract;

public enum EndianFormat
{
    /// <summary>Least Significant Bit order (lsb)</summary>
    /// <remarks>Right-to-Left</remarks>
    /// <see cref="BitConverter.IsLittleEndian"/>
    Little,
    /// <summary>Most Significant Bit order (msb)</summary>
    /// <remarks>Left-to-Right</remarks>
    Big,
};

/// <summary>Encodes/decodes bytes to/from a string</summary>
/// <remarks>
/// Encoded string is always in big-endian ordering
/// 
/// <p>Encode and Decode take a <b>includeProceedingZeros</b> parameter which acts as a work-around
/// for an edge case with our BigInteger implementation.
/// MSDN says BigInteger byte arrays are in LSB->MSB ordering. So a byte buffer with zeros at the 
/// end will have those zeros ignored in the resulting encoded radix string.
/// If such a loss in precision absolutely cannot occur pass true to <b>includeProceedingZeros</b>
/// and for a tiny bit of extra processing it will handle the padding of zero digits (encoding)
/// or bytes (decoding).</p>
/// <p>Note: doing this for decoding <b>may</b> add an extra byte more than what was originally 
/// given to Encode.</p>
/// </remarks>
// Based on the answers from http://codereview.stackexchange.com/questions/14084/base-36-encoding-of-a-byte-array/
public class RadixEncoding
{
    const int kByteBitCount = 8;

    readonly string kDigits;
    readonly double kBitsPerDigit;
    readonly BigInteger kRadixBig;
    readonly EndianFormat kEndian;
    readonly bool kIncludeProceedingZeros;

    /// <summary>Numerial base of this encoding</summary>
    public int Radix { get { return kDigits.Length; } }
    /// <summary>Endian ordering of bytes input to Encode and output by Decode</summary>
    public EndianFormat Endian { get { return kEndian; } }
    /// <summary>True if we want ending zero bytes to be encoded</summary>
    public bool IncludeProceedingZeros { get { return kIncludeProceedingZeros; } }

    public override string ToString()
    {
        return string.Format("Base-{0} {1}", Radix.ToString(), kDigits);
    }

    /// <summary>Create a radix encoder using the given characters as the digits in the radix</summary>
    /// <param name="digits">Digits to use for the radix-encoded string</param>
    /// <param name="bytesEndian">Endian ordering of bytes input to Encode and output by Decode</param>
    /// <param name="includeProceedingZeros">True if we want ending zero bytes to be encoded</param>
    public RadixEncoding(string digits,
        EndianFormat bytesEndian = EndianFormat.Little, bool includeProceedingZeros = false)
    {
        Contract.Requires<ArgumentNullException>(digits != null);
        int radix = digits.Length;

        kDigits = digits;
        kBitsPerDigit = System.Math.Log(radix, 2);
        kRadixBig = new BigInteger(radix);
        kEndian = bytesEndian;
        kIncludeProceedingZeros = includeProceedingZeros;
    }

    // Number of characters needed for encoding the specified number of bytes
    int EncodingCharsCount(int bytesLength)
    {
        return (int)Math.Ceiling((bytesLength * kByteBitCount) / kBitsPerDigit);
    }
    // Number of bytes needed to decoding the specified number of characters
    int DecodingBytesCount(int charsCount)
    {
        return (int)Math.Ceiling((charsCount * kBitsPerDigit) / kByteBitCount);
    }

    /// <summary>Encode a byte array into a radix-encoded string</summary>
    /// <param name="bytes">byte array to encode</param>
    /// <returns>The bytes in encoded into a radix-encoded string</returns>
    /// <remarks>If <paramref name="bytes"/> is zero length, returns an empty string</remarks>
    public string Encode(byte[] bytes)
    {
        Contract.Requires<ArgumentNullException>(bytes != null);
        Contract.Ensures(Contract.Result<string>() != null);

        // Don't really have to do this, our code will build this result (empty string),
        // but why not catch the condition before doing work?
        if (bytes.Length == 0) return string.Empty;

        // if the array ends with zeros, having the capacity set to this will help us know how much
        // 'padding' we will need to add
        int result_length = EncodingCharsCount(bytes.Length);
        // List<> has a(n in-place) Reverse method. StringBuilder doesn't. That's why.
        var result = new List<char>(result_length);

        // HACK: BigInteger uses the last byte as the 'sign' byte. If that byte's MSB is set, 
        // we need to pad the input with an extra 0 (ie, make it positive)
        if ( (bytes[bytes.Length-1] & 0x80) == 0x80 )
            Array.Resize(ref bytes, bytes.Length+1);

        var dividend = new BigInteger(bytes);
        // IsZero's computation is less complex than evaluating "dividend > 0"
        // which invokes BigInteger.CompareTo(BigInteger)
        while (!dividend.IsZero)
        {
            BigInteger remainder;
            dividend = BigInteger.DivRem(dividend, kRadixBig, out remainder);
            int digit_index = System.Math.Abs((int)remainder);
            result.Add(kDigits[digit_index]);
        }

        if (kIncludeProceedingZeros)
            for (int x = result.Count; x < result.Capacity; x++)
                result.Add(kDigits[0]); // pad with the character that represents 'zero'

        // orientate the characters in big-endian ordering
        if (kEndian == EndianFormat.Little)
            result.Reverse();
        // If we didn't end up adding padding, ToArray will end up returning a TrimExcess'd array, 
        // so nothing wasted
        return new string(result.ToArray());
    }

    void DecodeImplPadResult(ref byte[] result, int padCount)
    {
        if (padCount > 0)
        {
            int new_length = result.Length + DecodingBytesCount(padCount);
            Array.Resize(ref result, new_length); // new bytes will be zero, just the way we want it
        }
    }
    #region Decode (Little Endian)
    byte[] DecodeImpl(string chars, int startIndex = 0)
    {
        var bi = new BigInteger();
        for (int x = startIndex; x < chars.Length; x++)
        {
            int i = kDigits.IndexOf(chars[x]);
            if (i < 0) return null; // invalid character
            bi *= kRadixBig;
            bi += i;
        }

        return bi.ToByteArray();
    }
    byte[] DecodeImplWithPadding(string chars)
    {
        int pad_count = 0;
        for (int x = 0; x < chars.Length; x++, pad_count++)
            if (chars[x] != kDigits[0]) break;

        var result = DecodeImpl(chars, pad_count);
        DecodeImplPadResult(ref result, pad_count);

        return result;
    }
    #endregion
    #region Decode (Big Endian)
    byte[] DecodeImplReversed(string chars, int startIndex = 0)
    {
        var bi = new BigInteger();
        for (int x = (chars.Length-1)-startIndex; x >= 0; x--)
        {
            int i = kDigits.IndexOf(chars[x]);
            if (i < 0) return null; // invalid character
            bi *= kRadixBig;
            bi += i;
        }

        return bi.ToByteArray();
    }
    byte[] DecodeImplReversedWithPadding(string chars)
    {
        int pad_count = 0;
        for (int x = chars.Length - 1; x >= 0; x--, pad_count++)
            if (chars[x] != kDigits[0]) break;

        var result = DecodeImplReversed(chars, pad_count);
        DecodeImplPadResult(ref result, pad_count);

        return result;
    }
    #endregion
    /// <summary>Decode a radix-encoded string into a byte array</summary>
    /// <param name="radixChars">radix string</param>
    /// <returns>The decoded bytes, or null if an invalid character is encountered</returns>
    /// <remarks>
    /// If <paramref name="radixChars"/> is an empty string, returns a zero length array
    /// 
    /// Using <paramref name="IncludeProceedingZeros"/> has the potential to return a buffer with an
    /// additional zero byte that wasn't in the input. So a 4 byte buffer was encoded, this could end up
    /// returning a 5 byte buffer, with the extra byte being null.
    /// </remarks>
    public byte[] Decode(string radixChars)
    {
        Contract.Requires<ArgumentNullException>(radixChars != null);

        if (kEndian == EndianFormat.Big)
            return kIncludeProceedingZeros ? DecodeImplReversedWithPadding(radixChars) : DecodeImplReversed(radixChars);
        else
            return kIncludeProceedingZeros ? DecodeImplWithPadding(radixChars) : DecodeImpl(radixChars);
    }
};

Basic Unit Tests

using System;
using Microsoft.VisualStudio.TestTools.UnitTesting;

static bool ArraysCompareN<T>(T[] input, T[] output)
    where T : IEquatable<T>
{
    if (output.Length < input.Length) return false;
    for (int x = 0; x < input.Length; x++)
        if(!output[x].Equals(input[x])) return false;

    return true;
}
static bool RadixEncodingTest(RadixEncoding encoding, byte[] bytes)
{
    string encoded = encoding.Encode(bytes);
    byte[] decoded = encoding.Decode(encoded);

    return ArraysCompareN(bytes, decoded);
}
[TestMethod]
public void TestRadixEncoding()
{
    const string k_base36_digits = "0123456789abcdefghijklmnopqrstuvwxyz";
    var base36 = new RadixEncoding(k_base36_digits, EndianFormat.Little, true);
    var base36_no_zeros = new RadixEncoding(k_base36_digits, EndianFormat.Little, true);

    byte[] ends_with_zero_neg = { 0xFF, 0xFF, 0x00, 0x00 };
    byte[] ends_with_zero_pos = { 0xFF, 0x7F, 0x00, 0x00 };
    byte[] text = System.Text.Encoding.ASCII.GetBytes("A test 1234");

    Assert.IsTrue(RadixEncodingTest(base36, ends_with_zero_neg));
    Assert.IsTrue(RadixEncodingTest(base36, ends_with_zero_pos));
    Assert.IsTrue(RadixEncodingTest(base36_no_zeros, text));
}

这篇关于字节数组的 Base-N 编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆