散列树结构 [英] Hashing a Tree Structure

查看:40
本文介绍了散列树结构的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚在我的项目中遇到一个场景,我需要比较不同的树对象与已知实例的相等性,并且认为某种对任意树进行操作的散列算法将非常有用.

以下面的树为例:

<前>哦//哦哦/| |/| |哦哦哦哦//哦哦

其中每个O代表树的一个节点,是一个任意对象,有一个关联的散列函数.所以问题简化为:给定树结构节点的哈希码和已知结构,计算整棵树的(相对)无碰撞哈希码的合适算法是什么?

关于散列函数属性的几点说明:

  • 哈希函数应该取决于树中每个节点的哈希码及其位置.
  • 重新排序节点的子节点应该明显地改变结果哈希码.
  • 反映树的任何部分应该明显改变结果哈希码

如果有帮助,我在我的项目中使用 C# 4.0,尽管我主要是在寻找理论解决方案,因此伪代码、描述或另一种命令式语言的代码会很好.

<小时>

更新

好吧,这是我自己提出的解决方案.这里的几个答案对它有很大帮助.

每个节点(子树/叶节点)都有如下哈希函数:

public override int GetHashCode(){int hashCode = unchecked((this.Symbol.GetHashCode() * 31 +this.Value.GetHashCode()));for (int i = 0; i < this.Children.Count; i++)hashCode = unchecked(hashCode * 31 + this.Children[i].GetHashCode());返回哈希码;}

在我看来,这种方法的好处是可以缓存哈希码,并且仅在节点或其后代之一发生变化时才重新计算.(感谢 Vatine 和 Jason Orendorff 指出这一点.

无论如何,如果人们能在这里对我建议的解决方案发表评论,我将不胜感激 - 如果它做得很好,那就太好了,否则欢迎任何可能的改进.

解决方案

如果我要这样做,我可能会这样做:

对于每个叶节点,计算 0 的串联和节点数据的哈希值.

对于每个内部节点,从左到右计算 1 和任何本地数据的散列(注意:可能不适用)和子节点的散列的串联.

每当您更改任何内容时,这都会导致树级联,但这可能足够低,值得.如果与更改量相比,更改相对较少,那么使用加密安全哈希甚至可能更有意义.

Edit1: 也有可能为每个节点添加一个哈希有效"标志并简单地将假"传播到树上(或哈希无效"并传播真")在节点更改上的树上.这样,就可以在需要树哈希时避免完全重新计算,并可能避免多次未使用的哈希计算,但冒着在需要时获得哈希的可预测时间稍差的风险.

Edit3: 如果 GetHashCode 的结果可以为 0,Noldorin 在问题中建议的哈希码看起来有可能发生冲突.本质上,没有办法区分树由单个节点组成,具有符号散列"30 和值散列"25 以及两节点树,其中根的符号散列"为 0,值散列"为 30,子节点具有总哈希值为 25.这些示例完全是发明的,我不知道预期的哈希范围是什么,所以我只能评论我在所提供的代码中看到的内容.

使用 31 作为乘法常数是好的,因为它会导致在非位边界上发生任何溢出,尽管我认为,如果树中有足够的孩子和可能的对抗性内容,项目的哈希贡献早期散列可能被后来的散列项支配.

然而,如果散列在预期数据上表现不错,它看起来好像可以完成这项工作.它肯定比使用加密哈希更快(如下面列出的示例代码所示).

Edit2:至于具体算法和所需的最小数据结构,类似如下(Python,翻译成任何其他语言应该相对容易).

<前>#!/usr/bin/env python导入 Crypto.Hash.SHA类节点:def __init__ (self, parent=None, contents="", children=[]):self.valid = 假self.hash = 假self.contents = 内容self.children = 孩子def append_child (self, child):self.children.append(child)self.invalidate()def无效(自我):self.valid = 假如果 self.parent:self.parent.invalidate()def gethash(自我):如果自我有效:返回 self.hash摘要器 = crypto.hash.SHA.new()消化器.更新(self.contents)如果 self.children:对于 self.children 中的孩子:消化器.更新(child.gethash())self.hash = "1"+digester.hexdigest()别的:self.hash = "0"+digester.hexdigest()返回 self.hashdef setcontents (self):self.valid = 假返回 self.contents

I've just come across a scenario in my project where it I need to compare different tree objects for equality with already known instances, and have considered that some sort of hashing algorithm that operates on an arbitrary tree would be very useful.

Take for example the following tree:

        O
       / 
      /   
     O     O
    /|    |
   / |    |
  O  O  O  O
          / 
         /   
        O     O

Where each O represents a node of the tree, is an arbitrary object, has has an associated hash function. So the problem reduces to: given the hash code of the nodes of tree structure, and a known structure, what is a decent algorithm for computing a (relatively) collision-free hash code for the entire tree?

A few notes on the properties of the hash function:

  • The hash function should depend on the hash code of every node within the tree as well as its position.
  • Reordering the children of a node should distinctly change the resulting hash code.
  • Reflecting any part of the tree should distinctly change the resulting hash code

If it helps, I'm using C# 4.0 here in my project, though I'm primarily looking for a theoretical solution, so pseudo-code, a description, or code in another imperative language would be fine.


UPDATE

Well, here's my own proposed solution. It has been helped much by several of the answers here.

Each node (sub-tree/leaf node) has the following hash function:

public override int GetHashCode()
{
    int hashCode = unchecked((this.Symbol.GetHashCode() * 31 +
        this.Value.GetHashCode()));
    for (int i = 0; i < this.Children.Count; i++)
        hashCode = unchecked(hashCode * 31 + this.Children[i].GetHashCode());
    return hashCode;
}

The nice thing about this method, as I see it, is that hash codes can be cached and only recalculated when the node or one of its descendants changes. (Thanks to vatine and Jason Orendorff for pointing this out).

Anyway, I would be grateful if people could comment on my suggested solution here - if it does the job well, then great, otherwise any possible improvements would be welcome.

解决方案

If I were to do this, I'd probably do something like the following:

For each leaf node, compute the concatenation of 0 and the hash of the node data.

For each internal node, compute the concatenation of 1 and the hash of any local data (NB: may not be applicable) and the hash of the children from left to right.

This will lead to a cascade up the tree every time you change anything, but that MAY be low-enough of an overhead to be worthwhile. If changes are relatively infrequent compared to the amount of changes, it may even make sense to go for a cryptographically secure hash.

Edit1: There is also the possibility of adding a "hash valid" flag to each node and simply propagate a "false" up the tree (or "hash invalid" and propagate "true") up the tree on a node change. That way, it may be possible to avoid a complete recalculation when the tree hash is needed and possibly avoid multiple hash calculations that are not used, at the risk of slightly less predictable time to get a hash when needed.

Edit3: The hash code suggested by Noldorin in the question looks like it would have a chance of collisions, if the result of GetHashCode can ever be 0. Essentially, there is no way of distinguishing a tree composed of a single node, with "symbol hash" 30 and "value hash" 25 and a two-node tree, where the root has a "symbol hash" of 0 and a "value hash" of 30 and the child node has a total hash of 25. The examples are entirely invented, I don't know what expected hash ranges are so I can only comment on what I see in the presented code.

Using 31 as the multiplicative constant is good, in that it will cause any overflow to happen on a non-bit boundary, although I am thinking that, with sufficient children and possibly adversarial content in the tree, the hash contribution from items hashed early MAY be dominated by later hashed items.

However, if the hash performs decently on expected data, it looks as if it will do the job. It's certainly faster than using a cryptographic hash (as done in the example code listed below).

Edit2: As for specific algorithms and minimum data structure needed, something like the following (Python, translating to any other language should be relatively easy).

#! /usr/bin/env  python

import Crypto.Hash.SHA

class Node:
    def __init__ (self, parent=None, contents="", children=[]):
        self.valid = False
        self.hash = False
        self.contents = contents
        self.children = children


    def append_child (self, child):
        self.children.append(child)

        self.invalidate()

    def invalidate (self):
        self.valid = False
        if self.parent:
            self.parent.invalidate()

    def gethash (self):
        if self.valid:
            return self.hash

        digester = crypto.hash.SHA.new()

        digester.update(self.contents)

        if self.children:
            for child in self.children:
                digester.update(child.gethash())
            self.hash = "1"+digester.hexdigest()
        else:
            self.hash = "0"+digester.hexdigest()

        return self.hash

    def setcontents (self):
        self.valid = False
        return self.contents

这篇关于散列树结构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆