如何删除没有唯一标识符的重复行 [英] How to delete duplicate rows without unique identifier

查看:183
本文介绍了如何删除没有唯一标识符的重复行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的表中有重复的行,并且由于表很大,我想以最有效的方式删除重复的行.经过研究,我提出了以下查询:

I have duplicate rows in my table and I want to delete duplicates in the most efficient way since the table is big. After some research, I have come up with this query:

WITH TempEmp AS
(
SELECT name, ROW_NUMBER() OVER(PARTITION by name, address, zipcode ORDER BY name) AS duplicateRecCount
FROM mytable
)
-- Now Delete Duplicate Records
DELETE FROM TempEmp
WHERE duplicateRecCount > 1;

但是它仅适用于SQL,不适用于Netezza.似乎不喜欢WITH子句之后的DELETE?

But it only works in SQL, not in Netezza. It would seem that it does not like the DELETE after the WITH clause?

推荐答案

我喜欢@ erwin-brandstetter的解决方案,但想显示使用USING关键字的解决方案:

I like @erwin-brandstetter 's solution, but wanted to show a solution with the USING keyword:

DELETE   FROM table_with_dups T1
  USING       table_with_dups T2
WHERE  T1.ctid    < T2.ctid       -- delete the "older" ones
  AND  T1.name    = T2.name       -- list columns that define duplicates
  AND  T1.address = T2.address
  AND  T1.zipcode = T2.zipcode;

如果要在删除记录之前先查看记录,只需将DELETE替换为SELECT *,将USING替换为逗号,,即

If you want to review the records before deleting them, then simply replace DELETE with SELECT * and USING with a comma ,, i.e.

SELECT * FROM table_with_dups T1
  ,           table_with_dups T2
WHERE  T1.ctid    < T2.ctid       -- select the "older" ones
  AND  T1.name    = T2.name       -- list columns that define duplicates
  AND  T1.address = T2.address
  AND  T1.zipcode = T2.zipcode;

更新:我在这里测试了一些不同的解决方案以提高速度.如果您不希望有很多重复项,那么此解决方案的性能要比具有NOT IN (...)子句的解决方案好得多,因为这些解决方案会在子查询中生成很多行.

Update: I tested some of the different solutions here for speed. If you don't expect many duplicates, then this solution performs much better than the ones that have a NOT IN (...) clause as those generate a lot of rows in the subquery.

如果您重写查询以使用IN (...),则该查询的性能与此处提供的解决方案类似,但是SQL代码变得简洁得多.

If you rewrite the query to use IN (...) then it performs similarly to the solution presented here, but the SQL code becomes much less concise.

更新2:如果其中一个关键列中包含NULL值(您实际上不应该使用IMO),则可以在该列的条件中使用COALESCE(),例如

Update 2: If you have NULL values in one of the key columns (which you really shouldn't IMO), then you can use COALESCE() in the condition for that column, e.g.

  AND COALESCE(T1.col_with_nulls, '[NULL]') = COALESCE(T2.col_with_nulls, '[NULL]')

这篇关于如何删除没有唯一标识符的重复行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆