哈夫曼树中的唯一标识符节点 [英] Unique Identifiers Nodes in a Huffman Tree

查看:133
本文介绍了哈夫曼树中的唯一标识符节点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在构建一个Python程序来使用霍夫曼树压缩/解压缩文本文件.以前,我将频率表与.json文件一起存储在压缩文件的旁边.当我读入压缩数据和.json时,我将从频率表中重建解压缩树.我认为这是一个非常有说服力的解决方案.

I'm building a Python program to compress/decompress a text file using a Huffman tree. Previously, I would store the frequency table a .json file alongside the compressed file. When I read in the compressed data and .json, I would rebuild the decompression tree from the frequency table. I thought this was a pretty eloquent solution.

但是,我遇到了一个中等长度文件的奇怪问题,其中它们会解压缩为看似随机字符的字符串.我发现,当两个字符出现相同次数时,就会发生此问题.当我重建树时,频率匹配的那些字符中的任何一个都有被交换的机会.对于大多数文件,特别是大文件和小文件,这不是问题.大多数字母比其他字母略多或略少.但是对于某些中等大小的文件,很大一部分字符与另一个字符出现的次数相同,从而导致乱码.

However, I was running into an odd issue with files of medium length where they would decompress into strings of seemingly random characters. I found that the issue occurred when two character where occurring the same number of times. When I rebuilt my tree, any of those characters with matching frequencies would have the chance of getting swapped. For the majority of files, particularly large and small files, this wasn't a problem. Most letter occurred slightly more or slightly less than others. But for some medium sized files, a large portion of the characters occurred the same number of times as another character resulting in gibberish.

我的节点上是否有唯一的标识符,可用来轻松地重建我的树?还是我应该完全不同地对待树的写作?

Is there a unique identifier for my nodes that I can use instead to easily rebuild my tree? Or should I be approaching the tree writing completely differently?

推荐答案

  1. 在霍夫曼算法中,您需要以确定性的方式选择最低的两个频率,这在两侧都是相同的.如果有平局,则需要使用符号来打破平局.否则,您将无法保证当面对相同的频率时,两侧的排序都会选择相同的符号.

  1. In the Huffman algorithm you need to pick the lowest two frequencies in a deterministic way that is the same on both sides. If there is a tie, you need to use the symbol to break the tie. Without that, you have no assurance that the sorting on both sides will pick the same symbols when faced with equal frequencies.

您不需要发送频率.您只需要发送符号的位长即可.可以比频率更紧凑地编码长度.您可以仅使用长度来构建规范代码,并使用符号明确地对代码进行排序.

You don't need to send the frequencies. All you need to send is the bit lengths for the symbols. The lengths can be coded much more compactly than the frequencies. You can build a canonical code from just the lengths, using the symbols to order the codes unambiguously.

这篇关于哈夫曼树中的唯一标识符节点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆