标准化一个很大的表 [英] Normalizing an extremely big table

查看:81
本文介绍了标准化一个很大的表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我面临以下问题。我有一张大桌子。该表是以前从事该项目的人员的遗产。该表位于MS SQL Server中。

I face the following issue. I have an extremely big table. This table is a heritage from the people who previously worked on the project. The table is in MS SQL Server.

该表具有以下属性:


  1. 它有大约300列。它们都具有文本类型,但是其中一些最终应表示其他类型(例如,整数或日期时间)。因此,在使用这些文本值之前,必须先将其转换为适当的类型

  2. 表中的行超过100毫米。该表的空间很快将达到1 TB

  3. 该表没有任何索引

  4. 该表没有任何已实现的分区机制。

  1. it has about 300 columns. All of them have "text" type but some of them eventually should represent other types (for example, integer or datetime). So one has to convert this text values in appropriate types before using them
  2. the table has more than 100 milliom rows. The space for the table would soon reach 1 terabyte
  3. the table does not have any indices
  4. the table does not have any implemented mechanisms of partitioning.

您可能会猜测,无法对该表运行任何合理的查询。现在人们只将新记录插入表中,而没有人使用它。因此,我需要对其进行重组。我计划创建一个新结构,并用旧表中的数据重新填充新结构。显然,我将实现分区,但这不是唯一要做的事情。

As you may guess, it is impossible to run any reasonable query to this table. Now people only insert new records into the table but nobody uses it. So I need to restructure it. I plan to create a new structure and refill the new structure with the data from the old table. Obviously, I will implement partioning, but it is not the only thing to be done.

该表最重要的功能之一是那些纯文本字段(即不必将其转换为其他类型)通常会经常重复价值观。因此,给定列中值的实际变化范围是5-30个不同的值。这引发了进行规范化的想法:对于每个这样的文本列,我将创建一个附加表,其中包含该列中可能出现的所有不同值的列表,然后在该附加表中创建一个(tinyint)主键,然后将在原始表中使用适当的外键,而不是将这些文本值保留在原始表中。然后,我将在该外键列上放置一个索引。这样处理的列数大约为100。

One of the most important features of the table is that those fields that are purely textual (i.e. they don't have to be converted into another type) usually have frequently repeated values. So the actual variety of values in a given column is in the range of 5-30 different values. This induces the idea to make normalization: for every such a textual column I will create an additional table with the list of all the different values that may appear in this column, then I will create a (tinyint) primary key in this additional table and then will use an appropriate foreign key in the original table instead of keeping those text values in the original table. Then I will put an index on this foreign key column. The number of the columns to be processed this way is about 100.

它提出了以下问题:


  1. 这种归一化是否真的会提高在这100个字段中的某些字段上施加条件的查询的速度?如果我们忘记了保留这些列所需的大小,那么由于将初始文本列替换为tinyint列而导致性能会有所提高吗?如果我不进行任何归一化处理而只是在这些初始文本列上添加索引,那么性能是否与计划的tinyint列上的索引相同?

  2. 如果我这样做按照上述规范化,然后构建一个显示文本值的视图,则需要将我的主表与约100个其他表连接在一起。一个积极的时刻是,我将为 primary key = foreign key对进行这些联接。但是仍然应该连接大量的表。这是一个问题:对该视图进行查询的性能与对初始非规范化表的查询性能相比是否会更差? SQL Server Optimizer是否真的能够以允许利用规范化优势的方式优化查询?

抱歉

感谢您的每条评论!

PS
我创建了一个有关联接100个表的问题。
加入100张桌子

推荐答案

除了针对数据运行查询的速度外,您还会发现标准化数据的其他好处……例如大小和可维护性,仅凭这些就可以使数据标准化……

You'll find other benefits to normalizing the data besides the speed of queries running against it... such as size and maintainability, which alone should justify normalizing it...

但是,它也可能会提高查询速度;当前只有一行包含300个文本列的行非常庞大,并且几乎可以肯定已经超过了 8,060存储行数据页的字节限制 ...,而是存储在 ROW_OVERFLOW_DATA LOB_DATA 分配单位。

However, it will also likely improve the speed of queries; currently having a single row containing 300 text columns is massive, and is almost certainly past the 8,060 byte limit for storing the row data page... and is instead being stored in the ROW_OVERFLOW_DATA or LOB_DATA Allocation Units.

通过归一化减小每一行的大小,例如用 TINYINT 外键,并且还通过将不依赖于此大表主键的列删除到另一个表中,数据将不再溢出,并且每页还可以存储更多行。

By reducing the size of each row through normalization, such as replacing redundant text data with a TINYINT foreign key, and by also removing columns that aren't dependent on this large table's primary key into another table, the data should no longer overflow, and you'll also be able to store more rows per page.

至于执行 JOIN 来获取归一化数据所增加的开销……如果您正确地索引了表,则不应t会增加大量开销。但是,如果确实增加了不可接受的开销,则可以根据需要有选择地对数据进行规范化。

As far as the overhead added by performing JOIN to get the normalized data... if you properly index your tables, this shouldn't add a substantial amount of overhead. However, if it does add an unacceptable overhead, you can then selectively de-normalize the data as necessary.

这篇关于标准化一个很大的表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆