使用记录ID列表作为输入更新SQL Server中的400万条记录 [英] Updating 4 million records in SQL server using list of record-ids as input

查看:94
本文介绍了使用记录ID列表作为输入更新SQL Server中的400万条记录的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在迁移项目期间,我面临着SQL Server中400万条记录的更新。

During a migration project, I'm faced with an update of 4 millions records in our SQL Server.

更新非常简单;布尔字段必须设置为true / 1,而我输入的内容是必须填写此字段的所有ID的列表。(每行一个ID)

The update is very simple ; a boolean field needs to be set to true/1 and the input I have is a list of all the id's for which this field must be filled.(one id per line)

对于这种大小的SQL任务,我并不是专家,所以我开始尝试1条包含 WHERE xxx IN({id的列表,用逗号分隔})的UPDATE语句。 。首先,我尝试了一百万条记录。在测试服务器上的一个小型数据集上,这就像一个超级工具,但是在生产环境中却出现了错误。因此,我几次缩短了ID列表的长度,但无济于事。

I'm not exactly an expert when it comes to sql tasks of this size, so I started out trying 1 UPDATE statement containing a "WHERE xxx IN ( {list of ids, separated by comma} )". First, I tried this with a million records. On a small dataset on a test-server, this worked like a charm, but in the production environment this gave an error. So, I shortened the length of the list of ids a couple of times, but to no avail.

我接下来要尝试的是将列表中的每个ID转换为UPDATE语句( UPDATE yyy SET booleanfield = 1 WHERE id ='{id}')。在某个地方,我读到每x行有一个GO很好,所以我每100行插入一个GO(使用了从Unix移植的出色的 sed工具)。

The next thing I tried was to turn each id in the list into an UPDATE statement ("UPDATE yyy SET booleanfield = 1 WHERE id = '{id}'"). Somewhere, I read that it's good to have a GO every x number of lines, so I inserted a GO every 100 lines (using the excellent 'sed' tool, ported from unix).

因此,我将400万条更新语句的列表分成了250.000条,将它们保存为sql文件,然后开始将第一个语句加载并运行到SQL Server Management Studio中(2008)。请注意,我也尝试了SQLCMD.exe,但是令我惊讶的是,它的运行速度比SQL Studio慢10到20倍。

So, I separated the list of 4 million update statements into parts of 250.000 each, saved them as sql files and started loading and running the first one into SQL Server Management Studio (2008). Do note that I also tried SQLCMD.exe, but this, to my surprise, ran about 10-20 times slower than SQL Studio.

花了大约1.5个小时完成并导致查询已完成,但有错误。但是,消息列表包含一个不错的列表,其中包含受影响的1行和受影响的0行,后者用于查找ID的时间。

It took about 1,5 hour to complete and resulted in "Query completed with errors". The messages-list however, contained a nice list of "1 row(s) affected" and "0 row(s) affected", the latter for when the id was not found.

接下来,我使用COUNT(*)检查了表中的更新记录的数量,发现之间的几千条记录之间有几千条记录的差异更新语句和更新记录的数量。

Next, I checked the amount of updated records in the table using a COUNT(*) and found that there was a difference of a couple of thousand records between the amount of update statements and the amount of updated records.

然后我认为这可能是由于不存在的记录所致,但是当我减去 0 row( s)影响中,有895条记录的神秘缺口。

I then thought that that might be due to the non-existent records, but when I substracted the amount of "0 row(s) affected" in the output, there was a mysterious gap of 895 records.

我的问题:


  1. 有什么方法可以找到错误完成查询中的描述和错误原因。

  1. Is there any way to find out a description and cause of the errors in "Query completed with errors."

如何解释895条记录的神秘差距?

How could the mysterious gap of 895 records be explained ?

什么是更好的方法? (因为我开始认为自己的工作效率很低和/或容易出错)

What's a better, or the best, way to do this update ? (as I'm starting to think what I'm doing could be very inefficient and/or error-prone)


推荐答案

解决此问题的最佳方法是将400万条记录插入表中。实际上,您可以通过批量插入将它们放入带有标识列的表中。

The best way to approach this ask is by inserting the 4 million records into a table. In fact, you can put them into a table with an identity column, by "bulk inserting" into a view.

create table TheIds (rownum int identity(1,1), id int);

create view v_TheIds (select id from TheIds);

bulk insert into v_TheIds . . .

有了数据库中的所有数据,您现在有了更多选择。尝试更新:

With all the data in the database, you now have many more options. Try the update:

update t
    set booleanfield = 1
    where exists (select 1 from TheIds where TheIds.id = t.id)

您还应该在上创建索引TheIds(id)

这是一个较大的更新,全部作为一个事务执行。这可能会影响性能,并开始填充日志。您可以使用 rownum 列将其分解为较小的交易:

This is a large update, all executing as one transaction. That can have bad performance implications and start to fill the log. You can break it into smaller transactions using the rownum column:

update t
    set booleanfield = 1
    where exists (select 1 from TheIds where TheIds.id = t.id and TheIds.rownum < 1000)

这里的exist子句相当于左外部联接。主要区别在于,这种相关的子查询语法应在其他数据库中工作,这些数据库中的更新联接是特定于数据库的。

The exists clause here is doing the equivalent of the left outer join. The major difference is that this correlated subquery syntax should work in other databases, where joins with updates are database-specific.

使用 rownum 列中,您可以选择要更新的任意多行。因此,如果整体更新太大,则可以将更新放入循环中。

With the rownum column, you can select as many rows as you want for the update. So, you can put the update in a loop, if the overall update is too big:

where rownum < 100000
where rownum between 100000 and 199999
where rownum between 200000 and 299999

等等。您不必这样做,但是如果出于某种原因要批处理更新,则可以。

and so on. You don't have to do this, but you can if you want to batch the updates for some reason.

关键思想是将ID列表放入数据库中的表,因此您可以将数据库的功能用于后续操作。

The key idea is to get the list of ids into a table in the database, so you can use the power of the database for the subsequent operations.

这篇关于使用记录ID列表作为输入更新SQL Server中的400万条记录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆