高效的存储哈夫曼树的方法 [英] Efficient way of storing Huffman tree

查看:137
本文介绍了高效的存储哈夫曼树的方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个Huffman编码/解码工具,并且正在寻找一种高效的方式来存储创建用于存储输出文件内部的Huffman树。



目前有两个不同的版本,我正在实现。


  1. 这是一个字符读取整个文件到内存中,并建立一个频率表整个文档。这将只需要输出树一次,因此效率不是那么大的关注,除非输入文件很小。

  2. 我使用的另一种方法是读取大约64千字节大小的数据块,并运行频率分析,创建一个树和编码。然而,在这种情况下,在每个块之前,我将需要输出我的频率树,以便解码器能够重建其树并正确地解码编码文件。

在我的搜索中,我有没有找到一个很好的方式来存储树在尽可能小的空间,我希望StackOverflow社区可以帮助我找到一个很好的解决方案!

解决方案

由于您已经实现了代码来处理字节组织的流/文件之上的位元层,这里是我的建议。



Do不存储实际频率,它们不需要解码。



因此,对于每个节点,从根开始:


  1. 如果叶节点:输出1位+ N位字符/字节

  2. 如果不是叶节点,则输出0位。


  3. 要阅读,请执行以下操作:


    1. 读取位。如果为1,则读取N位字符/字节,返回没有子节点的新节点

    2. 如果位为0,则以相同的方式解码左右子节点,


    3. 叶节点基本上是没有子节点的任何节点。 p>

      使用这种方法,您可以在写入之前计算输出的确切大小,以确定增益是否足以证明这种努力。这假设你有一个包含每个字符的频率的键/值对的字典,其中frequency是实际发生的次数。



      计算的伪代码: / p>

        Tree-size = 10 * NUMBER_OF_CHARACTERS  -  1 
      Encoded-size = Sum(对于表中的每个字符, freq * len(PATH(char)))

      叶节点,并且比字符少一个内联节点。



      SIZE_OF_ONE_CHARACTER是位数,这两个将给你的位数我的树+编码数据的方法将占据。



      PATH(c)是一个函数/表,将产生从根到该字符的位路径



      这是一个C#伪代码,它假设一个字符只是一个简单的字节。

        void EncodeNode(Node node,BitWriter writer)
      {
      if(node.IsLeafNode)
      {
      writer.WriteBit (1)。
      writer.WriteByte(node.Value);
      }
      else
      {
      writer.WriteBit(0);
      EncodeNode(node.LeftChild,writer);
      EncodeNode(node.Right,writer);
      }
      }

      重新读取:

        Node ReadNode(BitReader reader)
      {
      if(reader.ReadBit()== 1)
      {
      return new Node(reader.ReadByte(),null,null);
      }
      else
      {
      node leftChild = ReadNode(reader);
      node rightChild = ReadNode(reader);
      return new Node(0,leftChild,rightChild);
      }
      }

      示例(简化,使用属性等)节点实现:

        public class Node 
      {
      public Byte Value;
      public Node LeftChild;
      public Node RightChild;

      public Node(Byte value,Node leftChild,Node rightChild)
      {
      Value = value;
      LeftChild = leftChild;
      RightChild = rightChild;
      }

      public Boolean IsLeafNode
      {
      get
      {
      return LeftChild == null;
      }
      }
      }


      下面是一个具体示例的输出示例。



      输入:AAAAAABCCCCCCDDEEEEE



      频率:




      • A:6

      • B:1

      • C:6

      • D:2

      • E:5



      每个字符只有8位,因此树的大小将是10 * 5 - 1 = 49位。



      树可能如下所示:

        20 
      ----------
      | 8
      | -------
      12 | 3
      ----- | -----
      ACEBD
      6 6 5 1 2

      每个字符的路径如下(0为左,1为右):




      • A:00

      • B:110

      • C:01

      • D:111

      • li>


      所以要计算输出大小:




      • A:6次出现* 2位= 12位

      • B:1次出现* 3位= 3位

      • < 12位
      • D:2次* 3位= 6位

      • E:5次* 2位= 10位



      编码字节总和为12 + 3 + 12 + 6 + 10 = 43位



      到从树的49位,并且输出将是92位或12字节。比较它和存储未编码的原始20个字符所需的20 * 8字节,您将保存8个字节。



      最后的输出包括开始的树,如下。流(A-E)中的每个字符被编码为8位,而0和1只是一个位。流中的空间只是将树与编码数据分开,并且不占用最终输出中的任何空间。

        001A1C01E01B1D 0000000000001100101010101011111111010101010 




      对于您在评论中具有的具体示例,AABCDEF,您将得到:



      输入:AABCDEF



      频率:




      • A:2

      • B:1

      • C:1

      • D:1

      • E:1

      • li>


      树:

        7 
      -------------
      | 4
      | ---------
      3 2 2
      ----- ----- -----
      ABCDEF
      2 1 1 1 1 1

      路径:




      • A:00

      • B:01

      • C:100

      • li>
      • E:110

      • F:111



      :001A1B001C1D01E1F = 59 bits

      数据:000001100101110111 = 18位

      和:59 + 18 = 77位= 10字节



      由于原始文件为8位= 7的7个字符,因此您将有太多的这类小数据的开销。


      I am writing a Huffman encoding/decoding tool and am looking for an efficient way to store the Huffman tree that is created to store inside of the output file.

      Currently there are two different versions I am implementing.

      1. This one reads the entire file into memory character by character and builds a frequency table for the whole document. This would only require outputting the tree once, and thus efficiency is not that big of a concern, other than if the input file is small.
      2. The other method I am using is to read a chunk of data, about 64 kilobyte in size and run the frequency analysis over that, create a tree and encode it. However, in this case before every chunk I will need to output my frequency tree so that the decoder is able to re-build its tree and properly decode the encoded file. This is where the efficiency does come into place since I want to save as much space as possible.

      In my searches so far I have not found a good way of storing the tree in as little space as possible, I am hoping the StackOverflow community can help me find a good solution!

      解决方案

      Since you already have to implement code to handle a bit-wise layer on top of your byte-organized stream/file, here's my proposal.

      Do not store the actual frequencies, they're not needed for decoding. You do, however, need the actual tree.

      So for each node, starting at root:

      1. If leaf-node: Output 1-bit + N-bit character/byte
      2. If not leaf-node, output 0-bit. Then encode both child nodes (left first then right) the same way

      To read, do this:

      1. Read bit. If 1, then read N-bit character/byte, return new node around it with no children
      2. If bit was 0, decode left and right child-nodes the same way, and return new node around them with those children, but no value

      A leaf-node is basically any node that doesn't have children.

      With this approach, you can calculate the exact size of your output before writing it, to figure out if the gains are enough to justify the effort. This assumes you have a dictionary of key/value pairs that contains the frequency of each character, where frequency is the actual number of occurances.

      Pseudo-code for calculation:

      Tree-size = 10 * NUMBER_OF_CHARACTERS - 1
      Encoded-size = Sum(for each char,freq in table: freq * len(PATH(char)))
      

      The tree-size calculation takes the leaf and non-leaf nodes into account, and there's one less inline node than there are characters.

      SIZE_OF_ONE_CHARACTER would be number of bits, and those two would give you the number of bits total that my approach for the tree + the encoded data will occupy.

      PATH(c) is a function/table that would yield the bit-path from root down to that character in the tree.

      Here's a C#-looking pseudo-code to do it, which assumes one character is just a simple byte.

      void EncodeNode(Node node, BitWriter writer)
      {
          if (node.IsLeafNode)
          {
              writer.WriteBit(1);
              writer.WriteByte(node.Value);
          }
          else
          {
              writer.WriteBit(0);
              EncodeNode(node.LeftChild, writer);
              EncodeNode(node.Right, writer);
          }
      }
      

      To read it back in:

      Node ReadNode(BitReader reader)
      {
          if (reader.ReadBit() == 1)
          {
              return new Node(reader.ReadByte(), null, null);
          }
          else
          {
              Node leftChild = ReadNode(reader);
              Node rightChild = ReadNode(reader);
              return new Node(0, leftChild, rightChild);
          }
      }
      

      An example (simplified, use properties, etc.) Node implementation:

      public class Node
      {
          public Byte Value;
          public Node LeftChild;
          public Node RightChild;
      
          public Node(Byte value, Node leftChild, Node rightChild)
          {
              Value = value;
              LeftChild = leftChild;
              RightChild = rightChild;
          }
      
          public Boolean IsLeafNode
          {
              get
              {
                  return LeftChild == null;
              }
          }
      }
      


      Here's a sample output from a specific example.

      Input: AAAAAABCCCCCCDDEEEEE

      Frequencies:

      • A: 6
      • B: 1
      • C: 6
      • D: 2
      • E: 5

      Each character is just 8 bits, so the size of the tree will be 10 * 5 - 1 = 49 bits.

      The tree could look like this:

            20
        ----------
        |        8
        |     -------
       12     |     3
      -----   |   -----
      A   C   E   B   D
      6   6   5   1   2
      

      So the paths to each character is as follows (0 is left, 1 is right):

      • A: 00
      • B: 110
      • C: 01
      • D: 111
      • E: 10

      So to calculate the output size:

      • A: 6 occurances * 2 bits = 12 bits
      • B: 1 occurance * 3 bits = 3 bits
      • C: 6 occurances * 2 bits = 12 bits
      • D: 2 occurances * 3 bits = 6 bits
      • E: 5 occurances * 2 bits = 10 bits

      Sum of encoded bytes is 12+3+12+6+10 = 43 bits

      Add that to the 49 bits from the tree, and the output will be 92 bits, or 12 bytes. Compare that to the 20 * 8 bytes necessary to store the original 20 characters unencoded, you'll save 8 bytes.

      The final output, including the tree to begin with, is as follows. Each character in the stream (A-E) is encoded as 8 bits, whereas 0 and 1 is just a single bit. The space in the stream is just to separate the tree from the encoded data and does not take up any space in the final output.

      001A1C01E01B1D 0000000000001100101010101011111111010101010
      


      For the concrete example you have in the comments, AABCDEF, you will get this:

      Input: AABCDEF

      Frequencies:

      • A: 2
      • B: 1
      • C: 1
      • D: 1
      • E: 1
      • F: 1

      Tree:

              7
        -------------
        |           4
        |       ---------
        3       2       2
      -----   -----   -----
      A   B   C   D   E   F
      2   1   1   1   1   1
      

      Paths:

      • A: 00
      • B: 01
      • C: 100
      • D: 101
      • E: 110
      • F: 111

      Tree: 001A1B001C1D01E1F = 59 bits
      Data: 000001100101110111 = 18 bits
      Sum: 59 + 18 = 77 bits = 10 bytes

      Since the original was 7 characters of 8 bits = 56, you will have too much overhead of such small pieces of data.

      这篇关于高效的存储哈夫曼树的方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆