分层数据结构设计(嵌套集) [英] Hierarchical Data Structure Design (Nested Sets)

查看:342
本文介绍了分层数据结构设计(嵌套集)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开发一个分层数据库结构的设计,它对包含产品的目录进行建模(这类似于此问题)。数据库平台是SQL Server 2005,目录相当大(750,000个产品,8个目录节超过4个级别),但是相对静态(每天重新加载一次),因此我们只关心READ性能。



目录层次结构的一般结构是: -




  • 级别1节


    • 第2级节


      • 第3级节


        • 4级部分(产品与此处链接)






我们使用嵌套集模式来存储层次结构级别,并将该级别上存在的产品存储在单独的链接表中。因此,简化的数据库结构将是

  CREATE TABLE CatalogueSection 

SectionID INTEGER,
ParentID INTEGER,
LeftExtent INTEGER,
RightExtent INTEGER


CREATE TABLE CatalogueProduct

ProductID INTEGER,
SectionID INTEGER



我们有一个额外的复杂因素,我们有大约1000个不同的客户群可能或可能不会看到目录中的所有产品。因此,我们需要为每个客户组维护一个目录层次结构的单独副本,以便在他们浏览目录时,他们只看到他们的产品,他们也看不到任何空白的部分。



为了方便起见,我们在下面的部分中维护一个层次结构累积的每个级别的产品数量表。所以,即使产品只是直接链接到层次结构的最低级别,它们被一直计数在树上。此表的结构是

  CREATE TABLE CatalogueSectionCount 

SectionID INTEGER,
CustomerGroupID INTEGER,
SubSectionCount INTEGER,
ProductCount INTEGER

strong>所以,到问题
在层次结构的顶层,性能非常差。显示所选目录部分(以及所有子部分)中的前10个产品的常规查询需要1分钟左右的时间才能完成。



我已经将索引(包括适当的覆盖索引)放在所有关键表上,运行它查询分析器,索引调整向导等,但仍然不能让它执行得足够快。



我想知道设计是否有根本的缺陷,还是因为我们有一个大数据集?我们有一个合理的开发服务器(3.8GHZ Xeon,4GB RAM),但它不工作)



感谢任何帮助



James

解决方案

使用闭包表。如果您的基本结构是具有字段ID和ParentID的父子结构,则闭包表的结构是ID和DescendantID。换句话说,闭包表是祖先 - 后代表,其中每个可能的祖先与所有后代相关联。如果需要,您可以包括LevelsBetween字段。闭包表实现通常包括自引用记录,即ID 1是后代ID 1的祖先,LevelsBetween为零。



示例:
父/ br>
ParentID - ID

1 - 2

1 - 3

3 - 4

3 - 5 < br>
4 - 6



祖先/后裔

ID - DescendantID - 级别

1 - 1 - 0

1 - 2 - 1

1 - 3 - 1

1 - 4 - 2

1 - 6 - 3

2 - 2 - 0

3 - 3 - 0

3 - 4 - 1

3 - 5 - 1

3 - 6 - 2

4 - 4 - 0

4 - 6 - 1

5 - 5 - 0



该表用于消除递归连接。将递归连接的负载推送到一天一次加载数据时执行的ETL周期。这会将它从查询中移除。



此外,它允许可变级别的层次结构。你不会卡在4。



最后,它允许在非叶节点插入产品。许多目录在层次结构的较高级别创建其他桶以创建叶子节点以附加产品。你不需要这样做,因为中间节点被包含在闭包中。



至于索引,我会在ID / DescendantID上做一个聚集索引。 / p>

现在为您的查询性能。这需要一个块,但不是全部。你提到了十大。这意味着排名你没有提到的一系列事实。我们需要细节来帮助调整这些。此外,这只获得叶级部分,而不是产品。至少,您应该在CatalogueProduct上有一个索引,按SectionID / ProductID订购。我将强制部分到产品连接是基于您提供的基数的循环连接。目录部分的报告将转到闭表以获取后代(使用聚集索引查找)。该后代列表将用于使用索引循环索引seek从CatalogueProduct获取产品。然后,使用这些产品,您将获得进行排名所需的事实。


I'm working on a design for a hierarchical database structure which models a catalogue containing products (this is similar to this question). The database platform is SQL Server 2005 and the catalogue is quite large (750,000 products, 8,500 catalogue sections over 4 levels) but is relatively static (reloaded once a day) and so we are only concerned about READ performance.

The general structure of the catalogue hierarchy is:-

  • Level 1 Section
    • Level 2 Section
      • Level 3 Section
        • Level 4 Section (products are linked to here)

We are using the Nested Sets pattern for storing the hierarchy levels and storing the products which exist at that level in a separate linked table. So the simplified database structure would be

CREATE TABLE CatalogueSection
(
    SectionID INTEGER,
    ParentID INTEGER,
    LeftExtent INTEGER,
    RightExtent INTEGER
)

CREATE TABLE CatalogueProduct
(
    ProductID INTEGER,
    SectionID INTEGER
)

We do have an added complication in that we have about 1000 separate customer groups which may or may not see all products in the catalogue. Because of this we need to maintain a separate "copy" of the catalogue hierarchy for each customer group so that when they browse the catalogue, they only see their products and they also don't see any sections which are empty.

To facilitate this we maintain a table of the number of products at each level of the hierarchy "rolled up" from the section below. So, even though products are only directly linked to the lowest level of the hierarchy, they are counted all the way up the tree. The structure of this table is

CREATE TABLE CatalogueSectionCount
(
    SectionID INTEGER,
    CustomerGroupID INTEGER,
    SubSectionCount INTEGER,
    ProductCount INTEGER
)

So, onto the problem Performance is very poor at the top levels of the hierarchy. The general query to show the "top 10" products in the selected catalogue section (and all child sections) is taking somewhere in the region of 1 minute to complete. At lower sections in the hierarchy it is faster but still not good enough.

I've put indexes (including covering indexes where applicable) on all key tables, run it through the query analyzer, index tuning wizard etc but still cannot get it to perform fast enough.

I'm wondering whether the design is fundamentally flawed or whether it's because we have such a large dataset? We have a reasonable development server (3.8GHZ Xeon, 4GB RAM) but it's just not working :)

Thanks for any help

James

解决方案

Use a closure table. If your basic structure is a parent-child with the fields ID and ParentID, then the structure for a closure table is ID and DescendantID. In other words, a closure table is an ancestor-descendant table, where each possible ancestor is associated with all descendants. You may include a LevelsBetween field if you need. Closure table implementations usually include self-referencing records, i.e. ID 1 is an ancestor of descendant ID 1 with LevelsBetween of zero.

Example: Parent/Child
ParentID - ID
1 - 2
1 - 3
3 - 4
3 - 5
4 - 6

Ancestor/Descendant
ID - DescendantID - LevelsBetween
1 - 1 - 0
1 - 2 - 1
1 - 3 - 1
1 - 4 - 2
1 - 6 - 3
2 - 2 - 0
3 - 3 - 0
3 - 4 - 1
3 - 5 - 1
3 - 6 - 2
4 - 4 - 0
4 - 6 - 1
5 - 5 - 0

The table is intended to eliminate recursive joins. You push the load of the recursive join into an ETL cycle that you do when you load the data once a day. That shifts it away from the query.

Also, it allows variable-level hierarchies. You won't be stuck at 4.

Finally, it allows you to slot products in non-leaf nodes. A lot of catalogs create "Miscellaneous" buckets at higher levels of the hierarchy to create a leaf-node to attach products to. You don't need to do that since intermediate nodes are included in the closure.

As far as indexing goes, I would do a clustered index on ID/DescendantID.

Now for your query performance. This takes a chunk out but not all. You mentioned a "Top 10". This implies ranking over a set of facts that you haven't mentioned. We need details to help tune those. Plus, this gets only gets the leaf-level sections, not the products. At the very least, you should have an index on your CatalogueProduct that orders by SectionID/ProductID. I would force Section to Product joins to be loop joins based on the cardinality you provided. A report on a catalog section would go to the closure table to get descendants (using a clustered index seek). That list of descendants would then be used to get products from CatalogueProduct using the index by looped index seeks. Then, with those products, you would get the facts necessary to do the ranking.

这篇关于分层数据结构设计(嵌套集)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆