缩小非常大表上的数据类型 [英] Narrowing the Data Types on a very large table

查看:24
本文介绍了缩小非常大表上的数据类型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含超过 1.5 亿条记录的表.目前的类型是:

I have a table that has over 150 Million records. Currently the types are thus:

id (Primary Key, Bigint)
idResult (Foreign key, Bigint null)
idItem (Foreign Key, Bigint null)
Number_1 (Bigint null)
Number_2 (Bigint null)
IsActive (Bigint null)

Number_1 和 Number_2 永远不能大于 10.IsActive 显然是一个布尔值.并且这些列在代码库中的任何其他地方都不可以为空.我还想将外键字段更改为 int,但那又是另外一回事了.这张桌子是在我开始之前几年建造的,我们正在经历一些成长的痛苦.

Number_1 and Number_2 can never be bigger than 10. IsActive is, plainly a Boolean. And none of these columns are nullable anywhere else in the code base. I'd also like to change the foreign key fields to int, but that is another story all together. This table was built years before I started, and we're experiencing some growing pains.

我正在寻找转换这些列(以及其他表上的其他几个列,尽管这是主要的违规者)并回收该磁盘空间的最佳方法.我已经尝试了一个直接的 Alter Table,但是,有点出乎意料,只是没有奏效.我不记得具体的错误,但我相信它与有问题的表的大小有关.

I'm looking for the best way to convert these columns (and several others on other tables, though this one is the main offender), and reclaim that disk space. I've tried a straight Alter Table, but that, somewhat expectedly, just didn't work. I can't remember the error specifically, but I believe it was related to the size of the table in question.

现在我正在考虑手动删除和重新创建表,但我正在努力想出更好的方法来做到这一点,除了 Select TOP 10000 * FROM dbo.TABLENAME WHERE id >0,并简单地多次增加 where 子句.

Right now I'm considering manually dropping and recreating the table, but i'm struggling to figure out a better way to do this, other than with a Select TOP 10000 * FROM dbo.TABLENAME WHERE id > 0, and simply incrementing the where clause several times.

我已经看过Switch To,但这要求目标表的数据类型与源表相匹配,这正是我要解决的问题!

I've looked at Switch To, but this requires that the data types of the target table match the source table, which is what I'm trying to fix!

有什么建议吗?我是不是看错了?

Any suggestions? Am I looking at this the wrong way?

推荐答案

首先,感谢您的参与.这是一场如此明显的胜利,以至于许多人不会看到它有多大价值,但这将是非常值得的:).让世界变得更加理智.

First off, thank you for doing this. It is such an obvious win that many wouldn't see much value in, but it will be well worth it :). Making the world ever so slightly saner.

关于 IsActive 是一个布尔值.我的猜测是您正在考虑将其设为 BIT 字段.这可能是要走的路,但有时最好使用 TINYINT,因为有可能将含义扩展到 2 个以上的状态.在这种情况下,它实际上变得更多是 StatusID.通常情况下,开始时很简单,活动/非活动,但后来可能是已删除和/或其他.从大小的角度来看,TINYINT 总是 1 个字节.另一方面,对于最多 8 个 BIT 字段,BIT 是 1 个字节.意思是,一个 BIT 字段是 1 个字节,2 个 BIT 字段也是一个字节,依此类推,最多 8 个 BIT 字段存储在一个单字节.因此,当表只有 1 个 BIT 字段时,选择 BIT 而不是 TINYINT 不会节省空间.只是需要考虑的事情.

Regarding IsActive being a boolean. My guess is that you are thinking of making it a BIT field. That might be the way to go, but sometimes it is better to go with TINYINT as there is the possibility of expanding the meaning into more than 2 states. In which case it really becomes more of StatusID. Usually it is a case of something starting out simplistically as Active / Inactive, but later on maybe Deleted and/or others. From a sizing perspective, TINYINT is always 1 byte. On the other hand, BIT is 1 byte for up to 8 BIT fields. Meaning, one BIT field is 1 byte, 2 BIT fields is also one byte, and so on up to 8 BIT fields being stored in a single byte. So, there is no space savings choosing BIT over TINYINT when the table only has 1 BIT field. Just something to consider.

如您所见,对大表执行 ALTER TABLE 有点多.一种选择,虽然不是很好,是添加一个 NOT NULL 字段--Number_1new--带有 DEFAULT 值(这将是由于默认值,至少从 SQL 2012 开始),它们都不会自然具有(例如 255),然后在循环中缓慢迁移值,如下所示:

Doing an ALTER TABLE is a bit much for a large table, as you saw. One option, though not a great one, is to add a NOT NULL field--Number_1new--with a DEFAULT value (this will be instantaneous due to the default, at least starting with SQL 2012) that none of them would naturally have (e.g. 255), and then slowly migrating the values, in a loop, as in:

UPDATE TOP (5000) tab
SET tab.Number_1new = tab.Number_1
FROM [table] tab
WHERE tab.Number_1new = 255;

完成后,请执行以下操作:

And when that is done then do:

sp_rename 'table.Number_1', 'Number_1old', 'COLUMN';
sp_rename 'table.Number_1new', 'Number_1', 'COLUMN';

当然,最好将其包装在 TRANSACTION 中,并将其包装在 TRY/CATCH 中.当相关代码更新完毕,一切都经过测试并且数据看起来不错时,您可以删除Number_1old列.

Of course, best to wrap that in a TRANSACTION, and that wrapped in a TRY / CATCH. When the related code has been updated and everything has been tested and the data looks good, then you can drop the Number_1old column.

然而,我发现的最好方法是创建一个新表,慢慢转换数据,然后同时交换表和代码.我在 SQL Server Central 的一篇文章中详细介绍了这些步骤:重组 1 亿行(或更多)秒表.SRSLY!(需要免费注册).以防万一访问该文章时出现问题,以下是基本步骤:

However, the best way that I have found is to create a new table, slowly transition the data over, then swap the tables and code at the same time. I detailed the steps in an article on SQL Server Central: Restructure 100 Million Row (or more) Tables in Seconds. SRSLY! (free registration required). Just in case there are issues getting to that article, here are the basic steps:

  1. 创建一个具有理想结构的新表--[tableNew].如果您使用的是企业版,请考虑启用 ROW 或 PAGE 压缩,因为它们有时会有所帮助.但是请先做一些研究,因为在某些情况下它们会产生负面影响.MSDN 上有文档可以帮助您弄清楚,还有一些工具可以帮助估计潜在的节省.但即使您确实启用了压缩,我也不会认为该操作是替换您在此处执行的项目.
  2. 在 [table] 上添加触发器 AFTER UPDATE, DELETE 以保持更改同步(但无需担心新行)
  3. 创建一个 SQL 代理作业,以批量移动丢失的行.在一个循环中执行此操作,该循环执行 INSERT INTO [tableNew] (Columns) SELECT TOP (n) Columns FROM [table] WHERE ??按 ?? 订购
  4. WHERE 和 ORDER BY 子句取决于具体情况.它们应该致力于充分利用聚集索引.如果新表的聚集索引在结构上与旧表/当前表相同,那么在每次循环开始时可以从[tableNew]中获取MAX([id])并用它来获取WHERE表.[id] >@MaxIdInTableNew ORDER BY 表.[id].
  5. 创建新表,在当前表上触发,并在需要进行完整切换前一周左右创建 SQL 代理作业.该时间范围可能会根据您的情况而改变,但请确保给自己足够的时间.对于这项工作来说,完成迁移行并且一次只进行少量迁移要好得多,而不是在应该开始发布时比全套集少 100k.
  6. 如果计划迁移其他相关表(要转换为 INT 的两个 FK 的 PK 引用),则将这些字段设置为 INT 现在不要添加 FK,直到其他表迁移到以 INT 字段作为其 PK 为止.您不想为了对 FK 字段进行更改而再次重建此表.
  7. 在切换期间(当然是在 TRY/CATCH 中):
  1. Create a new table with the ideal structure--[tableNew]. If you are on Enterprise Edition, consider enabling either ROW or PAGE compression as they can sometimes help. But please do some research first as there are some situation when they have a negative effect. There is documentation on MSDN to help you figure it out as well as some tools to help estimate potential saving. But even if you do enable compression, I wouldn't see that action as replacing the project you are doing here.
  2. Add a trigger AFTER UPDATE, DELETE on [table] to keep changes in sync (but no need to worry about new rows)
  3. Create a SQL Agent Job that moves over missing rows in batches. Do this in a loop that does an INSERT INTO [tableNew] (Columns) SELECT TOP (n) Columns FROM [table] WHERE ?? ORDER BY ??
  4. The WHERE and ORDER BY clauses depend on the situation. They should be geared towards making the best use of the clustered index. If the clustered index of the new table is structurally the same as the old/current table, then at the start of each loop you can get the MAX([id]) from [tableNew] and use it to get WHERE table.[id] > @MaxIdInTableNew ORDER BY table.[id].
  5. Create the new table, trigger on the current table, and SQL Agent Job a week or so before you need to do the full cut-over. That time-frame might change based on your situation, but just make sure to give yourself plenty of time. It is far better for the job to finish migrating rows and only have a few trickling in at a time as opposed to being 100k shy of the full set as the release is supposed to begin.
  6. If the plan is to migrate the other related tables (the PK references for the two FKs that you want to turn into INTs), then make those fields here INT now and just don't add the FK until those other tables are migrated over to having INT fields as their PKs. You don't want to have to rebuild this table again just to make that change for the FK fields.
  7. During the cut-over (in a TRY / CATCH, of course):
  1. 开始传输
  2. 对两个表进行最后的行计数以确保所有内容都被移过(可能需要在发布前对行进行理智检查以确保触发器按预期进行更新和删除)
  3. 将当前表重命名为旧"
  4. 将新"表重命名为没有新"表
  5. 删除 SQL 代理作业(或至少禁用它)
  6. 重命名和依赖对象,例如约束等
  7. 提交

这篇关于缩小非常大表上的数据类型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆