我们如何处理快速增长非常大的交叉表? [英] How can we deal with intersection tables that quickly grow very large?

查看:115
本文介绍了我们如何处理快速增长非常大的交叉表?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

例如,我们有具有多对多关系的表A和表B.交叉表,表C存储A.id和B.id以及表示两者之间的关系的值。或者作为一个具体的例子,想象stackexchange有一个用户帐户,一个论坛和一个karma分数。或者,一个学生,一个课程和一个年级。如果表A和B非常大,表C可以并且可能会非常快速地增长(实际上让我们只是假设它)。我们如何处理这样的问题?有没有更好的方法来设计表,以避免这种情况?

For example, we have table A, and table B which have a many-to-many relationship. An intersection table, Table C stores A.id and B.id along with a value that represents a relationship between the two. Or as a concrete example, imagine stackexchange which has a user account, a forum, and a karma score. Or, a student, a course, and a grade. If table A and B are very large, table C can and probably will grow monstrously large very quickly(in fact lets just assume it does). How do we go about dealing with such an issue? Is there a better way to design the tables to avoid this?

推荐答案

如果一些行被连接而另一些没有被连接,则该信息必须以某种方式表示,并且关系方式是连接(也称为链接)表。是的,一个连接表可以增长很大,但幸运的是数据库能够处理大量的数据。

There is no magic. If some rows are connected and some aren't, this information has to be represented somehow, and the "relational" way of doing it is a "junction" (aka "link") table. Yes, a junction table can grow large, but fortunately databases are very capable of handling huge amounts of data.

使用连接表和逗号分隔列表有很好的理由(或类似),包括:

There are good reasons for using junction table versus comma-separated list (or similar), including:


  • 高效查询(通过索引和聚类)。


设计连接表时,请询问以下问题:

When designing a junction table, ask the following questions:


  1. 我需要只在一个方向查询吗? 1


    • strong>一个方向,只需在两个外键上创建一个组合PRIMARY KEY(我们称之为PARENT_ID和CHILD_ID)。订单重要事项:如果您从家长查询儿童,则PK应为:{PARENT_ID,CHILD_ID}。

    • 如果两个方向

  1. Do I need to query in only one direction or both?1
    • If one direction, just create a composite PRIMARY KEY on both foreign keys (let's call them PARENT_ID and CHILD_ID). Order matters: if you query from parent to children, PK should be: {PARENT_ID, CHILD_ID}.
    • If both directions, also create a composite index in the opposite order, which is {CHILD_ID, PARENT_ID} in this case.
  • If yes, cluster the table and cover the extra data in the secondary index as necessary.2
  • I no, don't cluster the table and don't cover the extra data in the secondary index.3

  • 如果,请考虑添加代理键是否值得保持子FK细长。但请注意,如果您添加了代理键,则可能会消除群集的机会。

  • If yes, consider whether adding a surrogate key might be worthwhile to keep child FKs slim. But beware that if you add a surrogate key, this will probably eliminate the opportunity for clustering.


$ b b

在许多情况下,这些问题的答案将是:两者,是和否,在这种情况下,您的表格将类似于此(下面的Oracle语法):

In many cases, answers to these questions will be: both, yes and no, in which case your table will look similar to this (Oracle syntax below):

CREATE TABLE JUNCTION_TABLE (
    PARENT_ID INT,
    CHILD_ID INT,
    EXTRA_DATA VARCHAR2(50),
    PRIMARY KEY (PARENT_ID, CHILD_ID),
    FOREIGN KEY (PARENT_ID) REFERENCES PARENT_TABLE (PARENT_ID),
    FOREIGN KEY (CHILD_ID) REFERENCES CHILD_TABLE (CHILD_ID)
) ORGANIZATION INDEX COMPRESS;

CREATE UNIQUE INDEX JUNCTION_TABLE_IE1 ON
    JUNCTION_TABLE (CHILD_ID, PARENT_ID, EXTRA_DATA) COMPRESS;

注意事项:

  • ORGANIZATION INDEX: Oracle-specific syntax for what most DBMSes call clustering. Other DBMSes have their own syntax and some (MySQL/InnoDB) imply clustering and user cannot turn it off.
  • COMPRESS: Some DBMSes support leading-edge index compression. Since clustered table is essentially an index, compression can be applied to it as well.
  • JUNCTION_TABLE_IE1, EXTRA_DATA: Since extra data is covered by the secondary index, DBMS can get it without touching the table when querying in the direction from child to parents. Primary key acts as a clustering key so the extra data is naturally covered when querying from a parent to the children.

物理上,你有一个只有两个B树(一个是聚集表,另一个是次要索引),没有表堆。这转换为良好的查询性能(通过简单的索引范围扫描可以满足父子对子和子对父的方向)以及插入/删除行时相当小的开销。

Physically, you have just two B-Trees (one is the clustered table and the other is the secondary index) and no table heap at all. This translates to good querying performance (both parent-to-child and child-to-parent directions can be satisfied by a simple index range scan) and fairly small overhead when inserting/deleting rows.

以下是等效的MS SQL Server语法(无索引压缩):

Here is the equivalent MS SQL Server syntax (sans index compression):

CREATE TABLE JUNCTION_TABLE (
    PARENT_ID INT,
    CHILD_ID INT,
    EXTRA_DATA VARCHAR(50),
    PRIMARY KEY (PARENT_ID, CHILD_ID),
    FOREIGN KEY (PARENT_ID) REFERENCES PARENT_TABLE (PARENT_ID),
    FOREIGN KEY (CHILD_ID) REFERENCES CHILD_TABLE (CHILD_ID)
);

CREATE UNIQUE INDEX JUNCTION_TABLE_IE1 ON
    JUNCTION_TABLE (CHILD_ID, PARENT_ID) INCLUDE (EXTRA_DATA);

1 换句话说,

1 In other words, do you only need to get "children" of given "parent", or you might also need to get parents of given child.

sup> 2 覆盖允许仅通过索引满足查询,并避免昂贵的双查找,否则在通过集群表中的辅助索引访问数据时需要双查找。

2 Covering allows the query to be satisfied from the index alone, and avoids expensive double-lookup that would otherwise be necessary when accessing data through a secondary index in the clustered table.

3 这样,额外的数据不会重复(这将是昂贵的,因为它很大),但你避免双重查找和替换它(更便宜)表堆访问。但是,请注意可能会破坏性能的聚类因素基于堆的表中的范围扫描!

3 This way, the extra data is not repeated (which would be expensive, since it's big), yet you avoid the double-lookup and replace it with (cheaper) table heap access. But, beware of clustering factor that can destroy the performance of range scans in heap-based tables!

这篇关于我们如何处理快速增长非常大的交叉表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆