匹配嵌套XML数据与数据库表结构的最快方法 [英] Quickest method for matching nested XML data against database table structure

查看:159
本文介绍了匹配嵌套XML数据与数据库表结构的最快方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个应用程序创建 datarequests ,这可能非常复杂。这些需要作为表格存储在数据库中。一个 datarequest (作为XML)的大纲将是...

 < datarequest> 
< datatask view =vw_ContractDatadb =reportingindex =1>
< datefilter modifier =w0>
< filter index =1datatype =dcolumn =合约日期param1 =2009-10-19 12:00:00param2 =2012-09-27 12:00:00 daterange =operation =Between/>
< / datefilter>
< filters>
< alternateation index =1>
< filter index =1datatype =tcolumn =Departmentparam1 =Stockparam2 =operation =Equals/>
< / alternation>
< alternateation index =2>
< filter index =1datatype =tcolumn =Departmentparam1 =HRparam2 =operation =Equals/>
< / alternation>
< / filters>
< series column =Turnaroundaggregate =avgsplit =0splitfield =index =1>
< filters />
< / series>
< series column =Requested 3aggregate =avgsplit =0splitfield =index =2>
< filters>
< alternateation index =1>
< filter index =1datatype =tcolumn =Workerparam1 =Malcomparam2 =operation =Equals/>
< / alternation>
< / filters>
< / series>
< series column =Requested 2aggregate =avgsplit =0splitfield =index =3>
< filters />
< / series>
< series column =Reqestedaggregate =avgsplit =0splitfield =index =4>
< filters />
< / series>
< / datatask>
< / datarequest>

它对包含daterange,主要过滤器,系列和系列过滤器的数据表进行编码。基本上,具有索引属性的任何元素可能会在其父元素中发生多次 - 这个例外是过滤器 datefilter 中。



但是这种结构是一种学术性,问题更重要:



当请求出现时,像这样的XML将作为存储过程的参数发送到SQLServer。该XML被切割成一个去规范化的表,然后迭代地写入归一化表,例如 tblDataRequest(DataRequestID PK) tblDataTask tblFilter tblSeries 。这是很好的。



当我想要将一个给定的XML定义与已经存在于DB中的一个XML定义匹配时,就会出现此问题。我现在这样做...




  • 将XML切成一个去规范化的表格

  • 使用CTE将数据库中的所有现有数据拉到相同的解规范化形式

  • 使用巨大的 WHERE 条件(34行长)



..这将返回任何与给定的XML完全匹配的DataRequestID。我担心这种方法最终会变得很慢 - 部分原因是我不相信CTE会做出任何明智的过滤,它会在之前每次使用所有数据应用巨大的 WHERE



我以为必须有更好的解决方案,例如




  • 当存储 datarequest 时,还会以某种方式存储datarequest的哈希,并且只需匹配。在碰撞的情况下,使用当前的方法。我想要使​​用设置逻辑来做到这一点。而且,我担心XML中的不相关的小差异会改变散列 - 空格等。

  • 从下到上地反复执行匹配。例如,产生一个最低级别匹配的过滤器列表。使用它作为 IN 的一部分来匹配系列。使用它作为 IN 的一部分,以匹配DataTasks等。麻烦的是,当我想到这个时间太长时,我开始熄灭。



基本上 - 有人曾经遇到过这样的问题(他们一定有)。那么推荐的路由是什么?示例(伪)代码将会很好:)

解决方案

为了摆脱轻微差异的可能性,我将通过XML转换(XSLT)运行请求。



或者,由于您已经有代码将其解析成非正规化分期表,这样做也不错。然后,我将简单地使用 FOR XML 来创建一个新的XML文档。 / p>

您的目标是创建一个标准化的XML文档,在适当的情况下遵守排序,并删除不一致的不一致。



一旦完成,将其存储在新表中。现在,您可以直接比较标准化请求XML与现有数据。



要进行实际比较,可以使用哈希,将XML存储为字符串并进行直接字符串比较,或者进行如下完整的XML比较: http://beyondrelational.com/modules/2/blogs/28/posts/10317/xquery- lab-36-writing-a-tsql-function-to-compare-two-xml-values-part-2.aspx



我的偏好只要XML永远不会超过8000字节,将创建一个唯一的字符串(VARCHAR(8000)或NVARCHAR(4000),如果您有特殊字符支持),并在列上创建一个唯一的索引。


I have an application which creates datarequests which can be quite complex. These need to be stored in the database as tables. An outline of a datarequest (as XML) would be...

<datarequest>
  <datatask view="vw_ContractData" db="reporting" index="1">
    <datefilter modifier="w0">
      <filter index="1" datatype="d" column="Contract Date" param1="2009-10-19 12:00:00" param2="2012-09-27 12:00:00" daterange="" operation="Between" />
    </datefilter>
    <filters>
      <alternation index="1">
        <filter index="1" datatype="t" column="Department" param1="Stock" param2="" operation="Equals" />
      </alternation>
      <alternation index="2">
        <filter index="1" datatype="t" column="Department" param1="HR" param2="" operation="Equals" />
      </alternation>
      </filters>
    <series column="Turnaround" aggregate="avg" split="0" splitfield="" index="1">
      <filters />
    </series>
    <series column="Requested 3" aggregate="avg" split="0" splitfield="" index="2">
      <filters>
        <alternation index="1">
          <filter index="1" datatype="t" column="Worker" param1="Malcom" param2="" operation="Equals" />
        </alternation>          
      </filters>
    </series>
    <series column="Requested 2" aggregate="avg" split="0"  splitfield="" index="3">
      <filters />
    </series>
    <series column="Reqested" aggregate="avg" split="0" splitfield="" index="4">
      <filters />
    </series>
  </datatask>
</datarequest>

This encodes a datarequest comprising a daterange, main filters, series and series filters. Basically any element which has the index attribute can occur multiple times within its parent element - the exception to this being the filter within datefilter.

But the structure of this is kind of academic, the problem is more fundamental:

When a request comes through, XML like this is sent to SQLServer as a parameter to a stored proc. This XML is shredded into a de-normalised table and then written iteratively to normalised tables such as tblDataRequest (DataRequestID PK), tblDataTask, tblFilter, tblSeries. This is fine.

The problem occurs when I want to match a given XML defintion with one already held in the DB. I currently do this by...

  • Shredding the XML into a de-normalised table
  • Using a CTE to pull all the existing data in the database into that same de-normalised form
  • Matching using a huge WHERE condition (34 lines long)

..This will return me any DataRequestID which exactly matches the XML given. I fear that this method will end up being painfully slow - partly because I don't believe the CTE will do any clever filtering, it will pull all the data every single time before applying the huge WHERE.

I have thought there must be better solutions to this eg

  • When storing a datarequest, also store a hash of the datarequest somehow and simply match on that. In the case of collision, use the current method. I wanted however to do this using set-logic. And also, I'm concerned about irrelevant small differences in the XML changing the hash - spurious spaces etc.
  • Somehow perform the matching iteratively from the bottom up. Eg produce a list of filters which match on the lowest level. Use this as part of an IN to match Series. Use this as part of an IN to match DataTasks etc etc. The trouble is, I start to black-out when I think about this for too long.

Basically - Has anyone ever encountered this kind of problem before (they must have). And what would be the recommended route for tackling it? example (pseudo)code would be great :)

解决方案

To get rid of the possibility of minor variances, I'd run the request through an XML transform (XSLT).

Alternatively, since you've already got the code to parse this out into a denormalized staging table that's fine too. I would then simply using FOR XML to create a new XML doc.

Your goal here is to create a standardized XML document that respects ordering where appropriate and removes inconsistencies where it is not.

Once that is done, store this in a new table. Now you can run a direct comparison of the "standardized" request XML against existing data.

To do the actual comparison, you can use a hash, store the XML as a string and do a direct string comparison, or do a full XML comparison like this: http://beyondrelational.com/modules/2/blogs/28/posts/10317/xquery-lab-36-writing-a-tsql-function-to-compare-two-xml-values-part-2.aspx

My preference, as long as the XML is never over 8000bytes, would be to create a unique string (either VARCHAR(8000) or NVARCHAR(4000) if you have special character support) and create a unique index on the column.

这篇关于匹配嵌套XML数据与数据库表结构的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆