将列表映射到霍夫曼树,同时保留相对顺序 [英] Mapping a list to a Huffman Tree whilst preserving relative order

查看:86
本文介绍了将列表映射到霍夫曼树,同时保留相对顺序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在霍夫曼树上的搜索算法遇到问题:对于给定的概率分布,无论输入数据如何排列,我都需要霍夫曼树是相同的。



以下是发生的情况与我想要的情况的图片:





基本上我想知道是否可以保留列表中与树之间的相对顺序。如果不是,为什么会这样呢?



作为参考,我使用霍夫曼树根据概率划分来生成子组,以便我可以运行下面的search()过程。注意,merge()子例程中的数据与权重结合在一起。代码字本身并不像树那么重要(应该保留相对顺序)。



例如,如果我生成以下霍夫曼代码:

 概率= [0.30,0.25,0.20,0.15,0.10] 
项目= ['a','b','c' ,'d','e']
项目= zip(项目,概率)
t =编码(项目)
d,l = hi.search(t)
print(d )

使用以下类别:

 类节点(对象):左
=无
右=无
权重=无
数据=无
代码=无

def __init __(self,w,d):
self.weight = w
self.data = d

def set_children(self,ln ,rn):
self.left = ln
self.right = rn

def __repr __(self):
return [%s,%s,( %s),(%s)]%(self.data,self.code,self.left,self.right)

def __cmp __(self,a):
return cmp (自我体重,体重)

def merge(self,other):
total_freq = self.weight + other.weight
new_data = self.data + other.data
return Node( total_freq,new_data)

def索引(自身,节点):
返回node.weight

def编码(symbfreq):
pdb.set_trace( )
tree = [wt,symbfreq中的sym的Node(sym,wt)]
heapify(tree)
而len(tree)> ;:
lo,hi = heappop(树),heappop(树)
n = lo.merge(hi)
n.set_children(lo,hi)
heappush(tree,n)
tree = tree [ 0]

def Assign_code(节点,代码):
如果节点不是None:
node.code =代码
如果isinstance(节点,节点):
Assign_code(node.left,code +'0')
Assign_code(node.right,code +'1')

Assign_code(tree,'')
返回树

我得到:

 'a'-> 11 
'b'-> 0 1
'c'-> 00
'd'-> 101
'e'-> 100

但是,我在搜索算法中做出的一个假设是,更多可能的项目被推向左侧:也就是说,我需要'a'来拥有'00'码字-并且无论 abcde序列是否进行任何排列,都应该始终如此。示例输出为:

  codewords = {'a':'00','b':'01','c ':'10','d':'110','e':111'} 

(即使'c'的代码字是'd'的后缀,Nb也可以)。



为完整起见,以下是搜索算法:

  def search (树):
打印(树)
pdb.set_trace()
当前= tree.left
other = tree.right
循环= 0
while current:
循环+ = 1
print(current)
如果current.data!= 0并且current不是None且other不是None:
previous = current
当前=当前。左
其他=上一个。右
其他:
前一个=其他
当前=其他。左
其他=其他。右
返回上一个,循环

通过在0和1的组中搜索最左边的 1来工作-霍夫曼树必须在左侧放置更多可能的项目。例如,如果我使用上面的概率和输入:

  items = [1,0,1,0,0] 

然后算法返回的项目索引为2-这不是应该返回的内容(应该是0,因为它在最左边)。

解决方案

我将充实Mark Adler在工作代码中所说的内容。他说的一切都是正确的:-)要点:


  1. 您不得使用浮点加权或任何其他丢失信息的方案关于重量。使用整数。简单正确。例如,如果您有3位数的浮动概率,则可以通过 int(round(the_probability * 1000))将它们转换为整数,然后可能会摆弄它们以确保总和为恰好是1000。

  2. heapq 堆不是稳定的:如果多个项目具有相同的最小值,则不会弹出有关哪个项目的定义重量。

  3. 因此, 建造树时您将无法获得想要的东西。

  4. 规范霍夫曼代码似乎正是您想要的。为此构造树是一个漫长的过程,但是每个步骤都非常简单。丢弃的第一棵树被丢弃:从它那里获得的唯一信息是分配给每个符号的代码的 lengths

运行:

  syms = ['a','b','c','d', 'e'] 
权重= [30、25、20、15、10]
t =编码(符号,权重)
打印t

打印此文件(出于可读性考虑而格式化):

  [ abcde,
([ab,0,
([a,00,(无),(无)])),
([b,01,(无),(无) ])]),
([cde,1,
([c,10,(None),(None)])),
([de,11,
( [d,110,(None),(None)]),
([e,111,(None),(None)]))])))]

据我所知,这正是您想要的。如果不是,请投诉;-)



编辑:规范代码分配中有一个错误,没有出现除非重量差异很大;

  class节点(对象):
def __init __(self,data = None,weight = None,
left = None,right = None,
code = None):
self.data =数据
self.weight =权重
self.left =左
self.right =正确
self.code =代码

def is_symbol(self):
return self.left是self.right是None

def __repr __(self):
返回 [[%s,%s,(%s),(%s)]%(self.data,
self.code,
self.left,
self.right)

def __cmp __(self,a):
return cmp(self.weight,a.weight)

def编码(符号,权重):从heapq导入
导入heapify,heappush,heappop

树= [Node(data = s,weight = w)
for s,w in zip(syms,weights)]
sym2node = {s.data:s fo树中的r}}
heapify(tree)
而len(tree)> 1:
a,b = heappop(树),heappop(树)
heappush(tree,Node(weight = a.weight + b.weight,
left = a,right = b) )

#计算规范编码的代码长度。
sym2len = {}
def Assign_codelen(node,codelen):
,如果节点不为空:
,如果node.is_symbol():
sym2len [node.data ] = codelen
else:
Assign_codelen(node.left,codelen + 1)
Assign_codelen(node.right,codelen + 1)
Assign_codelen(tree [0],0)

#创建规范代码,但要有所不同:
#按字母顺序对符号进行排序,按
#在符号列表中的位置对其进行排序。
#构造一个(codelen,index,symbol)三元组的列表。
#ʻindex`会中断关系,因此具有相同
#代码长度的符号将保留其原始顺序。
三元组= [(sym2len [name],i,name)
for i,enumerate(syms)中的名称]
code = oldcodelen = 0
for codelen,_,name in sorted(triples):
如果codelen> oldcodelen:
代码< ==(codelen-oldcodelen)
sym2node [name] .code = format(code, 0%db%codelen)
代码+ = 1
oldcodelen = codelen

#创建一个与新代码相对应的树。
树= Node(code =)
dir2attr = { 0:左, 1:右}
for sym2node.values()中的snode: b $ b scode = snode.code
codesofar =
parent = tree
#遍历树,创建任何所需的内部节点。
代表s代码中的d:
断言父母不是无人
codesofar + = d
attr = dir2attr [d]
child = getattr(parent,attr)
if codesofar == scode:
#我们在叶子位置。
断言子项为None
setattr(父,attr,snode)
elif子项不为None:
断言child.code == codesofar
else:
child = Node(code = codesofar)
setattr(parent,attr,child)
parent = child

#最后,将data属性粘贴到$ b $中b#树。为什么?不知道;-)
def paste(node):
如果节点为None:
return
elif node.is_symbol():
返回节点.data
else:
结果= paste(node.left)+ paste(node.right)
node.data =结果
返回结果
paste(tree)

返回树



重复符号




我可以将sym2node dict换成orderdict来处理
重复的'a'/'b's等吗?


否,有两个原因:


  1. 没有映射类型支持重复键;并且,

  2. 重复符号的概念对于霍夫曼编码没有意义。

因此,如果您确定;-)进行此操作,则首先必须确保符号是唯一的。只需在函数的开头添加以下行:

  syms = list(enumerate(syms))

例如,如果传入的 syms 是:

  ['a','b','a'] 

将会更改为:

  [(0,'a'), (1,'b'),(2,'a')] 

所有符号现在都是2 -tuple,并且显然是唯一的,因为每个都以唯一的整数开头。 算法唯一关心的是符号可以用作字典键。不管它们是字符串,元组还是支持相等性测试的任何其他可哈希类型,都可以忽略。



因此,算法中的任何内容都无需更改。但在最后之前,我们要恢复原始符号。只需在 paste()函数之前插入它即可:

  def restore_syms( node):
如果node为None:
返回
elif node.is_symbol():
node.data = node.data [1]
else:
restore_syms(node.left)
restore_syms(node.right)
restore_syms(tree)

这只是简单地走树并从符号的 .data 成员中除去前导整数。或者,也许更简单,只需遍历 sym2node.values(),然后转换每个成员的 .data 成员。 / p>

I'm having an issue with a search algorithm over a Huffman tree: for a given probability distribution I need the Huffman tree to be identical regardless of permutations of the input data.

Here is a picture of what's happening vs what I want:

Basically I want to know if it's possible to preserve the relative order of the items from the list to the tree. If not, why is that so?

For reference, I'm using the Huffman tree to generate sub groups according to a division of probability, so that I can run the search() procedure below. Notice that the data in the merge() sub-routine is combined, along with the weight. The codewords themselves aren't as important as the tree (which should preserve the relative order).

For example if I generate the following Huffman codes:

probabilities = [0.30, 0.25, 0.20, 0.15, 0.10]
items = ['a','b','c','d','e']
items = zip(items, probabilities)
t = encode(items)
d,l = hi.search(t)
print(d)

Using the following Class:

class Node(object):
    left = None
    right = None
    weight = None
    data = None
    code = None

    def __init__(self, w,d):
        self.weight = w
        self.data = d

    def set_children(self, ln, rn):
        self.left = ln
        self.right = rn

    def __repr__(self):
        return "[%s,%s,(%s),(%s)]" %(self.data,self.code,self.left,self.right)

    def __cmp__(self, a):
        return cmp(self.weight, a.weight)

    def merge(self, other):
        total_freq = self.weight + other.weight
        new_data = self.data + other.data
        return Node(total_freq,new_data)

    def index(self, node):
        return node.weight

def encode(symbfreq):
    pdb.set_trace()
    tree = [Node(sym,wt) for wt,sym in symbfreq]
    heapify(tree)
    while len(tree)>1:
        lo, hi = heappop(tree), heappop(tree)
        n = lo.merge(hi)
        n.set_children(lo, hi)
        heappush(tree, n)
    tree = tree[0]

    def assign_code(node, code):
        if node is not None:
            node.code = code
        if isinstance(node, Node):
            assign_code(node.left, code+'0')
            assign_code(node.right, code+'1')

    assign_code(tree, '')
    return tree

I get:

'a'->11
'b'->01
'c'->00
'd'->101
'e'->100

However, an assumption I've made in the search algorithm is that more probable items get pushed toward the left: that is I need 'a' to have the '00' codeword - and this should always be the case regardless of any permutation of the 'abcde' sequence. An example output is:

codewords = {'a':'00', 'b':'01', 'c':'10', 'd':'110', 'e':111'}

(N.b even though the codeword for 'c' is a suffix for 'd' this is ok).

For completeness, here is the search algorithm:

def search(tree):
    print(tree)
    pdb.set_trace()
    current = tree.left
    other = tree.right
    loops = 0
    while current:
        loops+=1
        print(current)
        if current.data != 0 and current is not None and other is not None:
            previous = current
            current = current.left
            other = previous.right
        else:
            previous = other
            current = other.left
            other = other.right
    return previous, loops

It works by searching for the 'leftmost' 1 in a group of 0s and 1s - the Huffman tree has to put more probable items on the left. For example if I use the probabilities above and the input:

items = [1,0,1,0,0]

Then the index of the item returned by the algorithm is 2 - which isn't what should be returned (0 should, as it's leftmost).

解决方案

I'll flesh out what Mark Adler said with working code. Everything he said is right :-) The high points:

  1. You must not use floating-point weights, or any other scheme that loses information about weights. Use integers. Simple and correct. If, e.g., you have 3-digit floating probabilities, convert each to an integer via int(round(the_probability * 1000)), then maybe fiddle them to ensure the sum is exactly 1000.
  2. heapq heaps are not "stable": nothing is defined about which item is popped if multiple items have the same minimal weight.
  3. So you can't get what you want while building the tree.
  4. A small variation of "canonical Huffman codes" appears to be what you do want. Constructing a tree for that is a long-winded process, but each step is straightforward enough. The first tree built is thrown away: the only information taken from it is the lengths of the codes assigned to each symbol.

Running:

syms = ['a','b','c','d','e']
weights = [30, 25, 20, 15, 10]
t = encode(syms, weights)
print t

prints this (formatted for readability):

[abcde,,
    ([ab,0,
        ([a,00,(None),(None)]),
        ([b,01,(None),(None)])]),
    ([cde,1,
        ([c,10,(None),(None)]),
        ([de,11,
            ([d,110,(None),(None)]),
            ([e,111,(None),(None)])])])]

Best I understand, that's exactly what you want. Complain if it isn't ;-)

EDIT: there was a bug in the assignment of canonical codes, which didn't show up unless weights were very different; fixed it.

class Node(object):
    def __init__(self, data=None, weight=None,
                       left=None, right=None,
                       code=None):
        self.data = data
        self.weight = weight
        self.left = left
        self.right = right
        self.code = code

    def is_symbol(self):
        return self.left is self.right is None

    def __repr__(self):
        return "[%s,%s,(%s),(%s)]" % (self.data,
                                      self.code,
                                      self.left,
                                      self.right)

    def __cmp__(self, a):
        return cmp(self.weight, a.weight)

def encode(syms, weights):
    from heapq import heapify, heappush, heappop

    tree = [Node(data=s, weight=w)
            for s, w in zip(syms, weights)]
    sym2node = {s.data: s for s in tree}
    heapify(tree)
    while len(tree) > 1:
        a, b = heappop(tree), heappop(tree)
        heappush(tree, Node(weight=a.weight + b.weight,
                            left=a, right=b))

    # Compute code lengths for the canonical coding.
    sym2len = {}
    def assign_codelen(node, codelen):
        if node is not None:
            if node.is_symbol():
                sym2len[node.data] = codelen
            else:
                assign_codelen(node.left, codelen + 1)
                assign_codelen(node.right, codelen + 1)
    assign_codelen(tree[0], 0)

    # Create canonical codes, but with a twist:  instead
    # of ordering symbols alphabetically, order them by
    # their position in the `syms` list.
    # Construct a list of (codelen, index, symbol) triples.
    # `index` breaks ties so that symbols with the same
    # code length retain their original ordering.
    triples = [(sym2len[name], i, name)
                for i, name in enumerate(syms)]
    code = oldcodelen = 0
    for codelen, _, name in sorted(triples):
        if codelen > oldcodelen:
            code <<= (codelen - oldcodelen)
        sym2node[name].code = format(code, "0%db" % codelen)
        code += 1
        oldcodelen = codelen

    # Create a tree corresponding to the new codes.
    tree = Node(code="")
    dir2attr = {"0": "left", "1": "right"}
    for snode in sym2node.values():
        scode = snode.code
        codesofar = ""
        parent = tree
        # Walk the tree creating any needed interior nodes.
        for d in scode:
            assert parent is not None
            codesofar += d
            attr = dir2attr[d]
            child = getattr(parent, attr)
            if codesofar == scode:
                # We're at the leaf position.
                assert child is None
                setattr(parent, attr, snode)
            elif child is not None:
                assert child.code == codesofar
            else:
                child = Node(code=codesofar)
                setattr(parent, attr, child)
            parent = child

    # Finally, paste the `data` attributes together up
    # the tree.  Why?  Don't know ;-)
    def paste(node):
        if node is None:
            return ""
        elif node.is_symbol():
            return node.data
        else:
            result = paste(node.left) + paste(node.right)
            node.data = result
            return result
    paste(tree)

    return tree

Duplicate symbols

Could I swap the sym2node dict to an ordereddict to deal with repeated 'a'/'b's etc?

No, and for two reasons:

  1. No mapping type supports duplicate keys; and,
  2. The concept of "duplicate symbols" makes no sense for Huffman encoding.

So, if you're determined ;-) to pursue this, first you have to ensure that symbols are unique. Just add this line at the start of the function:

        syms = list(enumerate(syms))

For example, if the syms passed in is:

['a', 'b', 'a']

that will change to:

[(0, 'a'), (1, 'b'), (2, 'a')]

All symbols are now 2-tuples, and are obviously unique since each starts with a unique integer. The only thing the algorithm cares about is that symbols can be used as dict keys; it couldn't care less whether they're strings, tuples, or any other hashable type that supports equality testing.

So nothing in the algorithm needs to change. But before the end, we'll want to restore the original symbols. Just insert this before the paste() function:

    def restore_syms(node):
        if node is None:
            return
        elif node.is_symbol():
            node.data = node.data[1]
        else:
            restore_syms(node.left)
            restore_syms(node.right)
    restore_syms(tree)

That simply walks the tree and strips the leading integers off the symbols' .data members. Or, perhaps simpler, just iterate over sym2node.values(), and transform the .data member of each.

这篇关于将列表映射到霍夫曼树,同时保留相对顺序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆