检查大表中的完整重复行 [英] Check for complete duplicate rows in a large table

查看:140
本文介绍了检查大表中的完整重复行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的原始问题与所有相关的内容可以在这里找到:

将多列主键添加到具有4000万条记录的表

My original question with all the relevant context can be found here:
Adding a multi-column primary key to a table with 40 million records

我有一个4000万行的表,没有主键。在添加主键之前,我想检查表是否有重复的条目。当我说重复条目时,我不仅仅是指特定列上的重复。我的意思是重复整个行。

I have a table with 40 million rows and no primary key. Before I add the primary key, I would like to check if the table has any duplicate entries. When I say duplicate entries, I don't just mean duplicate on particular columns. I mean duplicates on entire rows.

我在最后一个问题中被告知我可以执行 EXISTS 查询确定重复。我该怎么做?

I was told in my last question that I can do an EXISTS query to determine duplicates. How would I do that?

我正在运行PostgreSQL 8.1.22。 (通过运行选择版本()获得此信息)。

I am running PostgreSQL 8.1.22. (Got this info by running select version()).

推荐答案

要查找是否存在 任何 列),这可能是最快的方式:

To find whether any full duplicate exists (identical on all columns), this is probably the fastest way:

SELECT EXISTS (
    SELECT 1
    FROM   tbl t
    NATURAL JOIN tbl t1 
    WHERE  t.ctid <> t1.ctid
    )

NATURAL JOIN 是一个非常方便的简写,因为(引用手册):

NATURAL JOIN is a very convenient shorthand for the case because (quoting the manual here):


NATURAL USING 列表的缩写,该列表提及
两个表中所有列的列

NATURAL is shorthand for a USING list that mentions all columns in the two tables that have the same names.

EXISTS 可能是最快的,becau一旦找到第一个副本,Post Post会立即停止搜索。因为你很可能没有一个覆盖整行的索引,而且你的表格很大,这样可以节省你很多时间。

EXISTS is probably fastest, because Postgres stops searching as soon as the first duplicate is found. Since you most probably don't have an index covering the whole row and your table is huge, this will save you a lot of time.

请注意, NULL 从不认为与另一个 NULL 。如果您有 NULL 值,并认为它们相同,则必须执行更多操作。

Be aware that NULL is never considered identical to another NULL. If you have NULL values and consider them identical, you'd have to do more.

ctid 是一个系统列可以(ab-)用作临时主键,但长期来看不能替换实际的用户定义的主键。

ctid is a system column that can be (ab-)used as ad-hoc primary key, but cannot replace an actual user-defined primary key in the long run.

过期版本8.1似乎没有为 ctid定义的<> 运算符。尝试转换为文本

The outdated version 8.1 seems to have no <> operator defined for a ctid. Try casting to text:

SELECT EXISTS (
    SELECT 1
    FROM   tbl t
    NATURAL JOIN tbl t1 
    WHERE  t.ctid::text <> t1.ctid::text
    )

这篇关于检查大表中的完整重复行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆