更新 SQL Server 表中除一条重复记录之外的所有记录 [英] Update all but one of duplicate records in table in SQL Server
问题描述
我有一个 SQL Server 表,其中一列 (object_id
) 中有重复的条目,例如:
I have a SQL Server table that has duplicate entries in one of the columns (object_id
) e.g.:
+----+-----------+------------+
| id | object_id | status_val |
+----+-----------+------------+
| 1 | 1 | 0 |
| 2 | 1 | 0 |
| 3 | 1 | 0 |
| 4 | 2 | 0 |
| 5 | 3 | 0 |
| 6 | 3 | 0 |
+----+-----------+------------+
当 object_id
列中有重复时,我需要更新它们的所有状态,除了一个.所以在上表 object_id
1 和 3 是重复的.所以我想将他们的 status_val
更改为 2,但其中一个条目除外.结果如下:
I need to update all of their statuses except for one when there is duplication in the object_id
column. So in the table above object_id
1 and 3 are duplicated. So I would want to change their status_val
to 2, except one of the entries. The result would look like:
| id | object_id | status_val |
+----+-----------+------------+
| 1 | 1 | 0 |
| 2 | 1 | 2 |
| 3 | 1 | 2 |
| 4 | 2 | 0 |
| 5 | 3 | 0 |
| 6 | 3 | 2 |
+----+-----------+------------+
复制哪一行的状态更新并不重要.
It doesn't matter which one of the duplicated rows has it's status updated.
任何帮助将不胜感激.
推荐答案
你可以在没有 join 的情况下解决这个问题,这意味着它应该有更好的性能.这个想法是按您的 object_id 对数据进行分组,计算每个 object_id 的行号.这就是分区"的作用.然后您可以更新 row_num > 1 的位置.这将更新除第一个之外的所有重复 object_id!
You may solve this problem without a join, which means it should have better performance. The idea is to group the data by your object_id, counting the row number of each object_id. This is what "partition by" does. Then you can update where the row_num is > 1. This will update all duplicated object_id except the first one!
update t set t.status_val = 'some_status'
from (
select *, row_number() over(partition by object_id order by (select null)) row_num
from foo
) t
where row_num > 1
在 82944 条记录的测试表上,性能是这样的(您的里程可能会有所不同!):表'测试'.扫描计数 5,逻辑读取 82283,物理读取 0,预读读取 0,lob 逻辑读取 0,lob 物理读取 0,lob 预读读取 0.CPU 时间 = 141 毫秒,经过时间 = 150 毫秒.
On a test table of 82944 records, the performance was such (your mileage may vary!): Table 'test'. Scan count 5, logical reads 82283, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. CPU time = 141 ms, elapsed time = 150 ms.
我们当然也可以通过使用内连接来解决这个问题,但是,通常这会导致更多的逻辑读取和更高的 CPU:
We can certainly also solve this problem by using an inner join, however, in general this should lead to more logical reads and higher CPU:
表'测试'.扫描计数10,逻辑读83622,物理读0,预读0,lob逻辑读0,lob物理读0,lob预读0.表工作文件".扫描计数 0,逻辑读 0,物理读 0,预读 0,lob 逻辑读 0,lob 物理读 0,lob 预读 0.表工作台".扫描计数4,逻辑读167426,物理读0,预读0,lob逻辑读0,lob物理读0,lob预读0.CPU 时间 = 342 毫秒,经过时间 = 233 毫秒.
Table 'test'. Scan count 10, logical reads 83622, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Worktable'. Scan count 4, logical reads 167426, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. CPU time = 342 ms, elapsed time = 233 ms.
要循环结果并小批量更新:
To loop over the results and update in smaller batches:
declare @rowcount int = 1;
declare @batch_size int = 1000;
while @rowcount > 0
begin
update top(@batch_size) t set t.status_val = 'already updated'
from (
select *, row_number() over(partition by object_id order by (select null)) row_num
from foo
where status_val <> 'already updated'
) t
where row_num > 1
set @rowcount = @@rowcount;
end
如果其他并发会话试图访问此表,这将有助于保持锁定.
This will help keep locking down if other concurrent sessions are trying to access this table.
这篇关于更新 SQL Server 表中除一条重复记录之外的所有记录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!