高效的算法比较XML节点 [英] Efficient algorithm for comparing XML nodes

查看:180
本文介绍了高效的算法比较XML节点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要确定一个XML文档中两个不同的子节点是否相等或不。两个节点应该被认为是相等的,如果他们有相同的属性集和儿童笔记和所有子笔记相等,太(即整个子树应该是平等的)。

I want to determine whether two different child nodes within an XML document are equal or not. Two nodes should be considered equal if they have the same set of attributes and child notes and all child notes are equal, too (i.e. the whole sub tree should be equal).

输入文件可能会很大(高达60MB,超过100000节点比较)以及性能是一个问题。

The input document might be very large (up to 60MB, more than a 100000 nodes to compare) and performance is an issue.

这将是一个有效的方法来检查两个节点的平等?

What would be an efficient way to check for the equality of two nodes?

示例:

<w:p>
  <w:pPr>
    <w:spacing w:after="120"/>
  </w:pPr>
  <w:r>
    <w:t>Hello</w:t>
  </w:r>
</w:p>
<w:p>
  <w:pPr>
    <w:spacing w:after="240"/>
  </w:pPr>
  <w:r>
    <w:t>World</w:t>
  </w:r>
</w:p>

此XML片段描述了OpenXML文档中的段落。算法将被用来确定文档是否包含第(瓦特:对节点)具有相同属性(重量:PPR节点)。作为另一段落说明文件之前

This XML snippet describes paragraphs in an OpenXML document. The algorithm would be used to determine whether a document contains a paragraph (w:p node) with the same properties (w:pPr node) as another paragraph earlier in the document.

一个想法我会为节点的外部XML存储在一个散集(通常情况下我会得到一个规范的字符串重新presentation第一,其中的属性和子笔记以同样的方式总是排序,但我希望我的节点已经是在这样的形式)。

One idea I have would be to store the nodes' outer XML in a hash set (Normally I would have to get a canonical string representation first where attributes and child notes are sorted always in the same way, but I can expect my nodes already to be in such a form).

另一个想法是为每个节点创建一个XmlNode对象,写一个比较器,其比较所有属性和子节点。

Another idea would be to create an XmlNode object for each node and write a comparer which compares all attributes and child nodes.

我的环境是C#(.NET 2.0);任何反馈,进一步的想法是非常值得欢迎的。也许有人甚至有已经是一个很好的解决方案?

My environment is C# (.Net 2.0); any feedback and further ideas are very welcome. Maybe somebody even has already a good solution?

编辑:微软的xmldiff API实际上可以这样做,但我不知道是否会有一个更轻量级的方法。的xmldiff似乎总是产生一个DiffGram,并总是产生重presentation规范节点首先,这两个东西我不需要。

Microsoft's XmlDiff API can actually do that but I was wondering whether there would be a more lightweight approach. XmlDiff seems to always produce a diffgram and to always produce a canonical node representation first, both things which I don't need.

EDIT2:我终于实现我自己的XmlNodeEqualityComparer基于此提出的建议。非常感谢!!!!

I finally implemented my own XmlNodeEqualityComparer based on the suggestion made here. Thanks a lot!!!!

谢谢, DIVO

推荐答案

我会建议不要使用您自己的哈希生成功能,而是依靠内置 XNodeEqualityComparer GetHash code 方法。这保证顾及属性和子代节点产生的结果,可以为您节省一些时间时也。

I'd recommend against rolling your own hash creation function and instead rely on the in-built XNodeEqualityComparer's GetHashCode method. This guarantees to take account of attributes and descendant nodes when creating the result and could save you some time too.

您code将如下所示:

Your code would look like the following:

XNodeEqualityComparer comparer = new XNodeEqualityComparer();
XDocument doc = XDocument.Load("XmlFile1.xml");
Dictionary<int, XNode> nodeDictionary = new Dictionary<int, XNode>();

foreach (XNode node in doc.Elements("doc").Elements("node"))
{
    int hash = comparer.GetHashCode(node);
    if (nodeDictionary.ContainsKey(hash))
    {
        // A duplicate has been found. Execute your logic here
        // ...
    }
    else
    {
        nodeDictionary.Add(hash, node);
    }
}

我的XmlFile1.xml是:

My XmlFile1.xml is:

<?xml version="1.0" encoding="utf-8" ?>
<doc>
  <node att="A">Blah</node>
  <node att="A">Blah</node>
  <node att="B">
    <inner>Innertext</inner>
  </node>
  <node>Blah</node>
  <node att="B">
    <inner>Different</inner>
  </node>
</doc>

nodeDictionary 最终将包含节点及其哈希的一个独特的收藏。重复是通过使用词典的containsKey 方法,传入节点的哈希值,我们生成使用检测在 XNodeEqualityComparer GetHash code 方法。

nodeDictionary will end up containing a unique collection of Nodes and their hashes. Duplicates are detected by using the Dictionary's ContainsKey method, passing in the hash of the node, which we generate using the XNodeEqualityComparer's GetHashCode method.

我想这应该是速度不够快,满足您的需求。

I think this should be fast enough for your needs.

这篇关于高效的算法比较XML节点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆