如何删除没有唯一标识符的重复行 [英] How to delete duplicate rows without unique identifier
问题描述
我的表中有重复的行,并且由于表很大,我想以最有效的方式删除重复的行.经过研究,我提出了以下查询:
I have duplicate rows in my table and I want to delete duplicates in the most efficient way since the table is big. After some research, I have come up with this query:
WITH TempEmp AS
(
SELECT name, ROW_NUMBER() OVER(PARTITION by name, address, zipcode ORDER BY name) AS duplicateRecCount
FROM mytable
)
-- Now Delete Duplicate Records
DELETE FROM TempEmp
WHERE duplicateRecCount > 1;
但是它仅适用于SQL,不适用于Netezza.似乎不喜欢WITH
子句之后的DELETE
?
But it only works in SQL, not in Netezza. It would seem that it does not like the DELETE
after the WITH
clause?
推荐答案
我喜欢@ erwin-brandstetter的解决方案,但想显示使用USING
关键字的解决方案:
I like @erwin-brandstetter 's solution, but wanted to show a solution with the USING
keyword:
DELETE FROM table_with_dups T1
USING table_with_dups T2
WHERE T1.ctid < T2.ctid -- delete the "older" ones
AND T1.name = T2.name -- list columns that define duplicates
AND T1.address = T2.address
AND T1.zipcode = T2.zipcode;
如果要在删除记录之前先查看记录,只需将DELETE
替换为SELECT *
,将USING
替换为逗号,
,即
If you want to review the records before deleting them, then simply replace DELETE
with SELECT *
and USING
with a comma ,
, i.e.
SELECT * FROM table_with_dups T1
, table_with_dups T2
WHERE T1.ctid < T2.ctid -- select the "older" ones
AND T1.name = T2.name -- list columns that define duplicates
AND T1.address = T2.address
AND T1.zipcode = T2.zipcode;
更新:我在这里测试了一些不同的解决方案以提高速度.如果您不希望有很多重复项,那么此解决方案的性能要比具有NOT IN (...)
子句的解决方案好得多,因为这些解决方案会在子查询中生成很多行.
Update: I tested some of the different solutions here for speed. If you don't expect many duplicates, then this solution performs much better than the ones that have a NOT IN (...)
clause as those generate a lot of rows in the subquery.
如果您重写查询以使用IN (...)
,则该查询的性能与此处提供的解决方案类似,但是SQL代码变得简洁得多.
If you rewrite the query to use IN (...)
then it performs similarly to the solution presented here, but the SQL code becomes much less concise.
更新2:如果其中一个关键列中包含NULL
值(您实际上不应该使用IMO),则可以在该列的条件中使用COALESCE()
,例如
Update 2: If you have NULL
values in one of the key columns (which you really shouldn't IMO), then you can use COALESCE()
in the condition for that column, e.g.
AND COALESCE(T1.col_with_nulls, '[NULL]') = COALESCE(T2.col_with_nulls, '[NULL]')
这篇关于如何删除没有唯一标识符的重复行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!