检查大表中的完整重复行 [英] Check for complete duplicate rows in a large table
问题描述
我的原始问题与所有相关的内容可以在这里找到:
将多列主键添加到具有4000万条记录的表
My original question with all the relevant context can be found here:
Adding a multi-column primary key to a table with 40 million records
我有一个4000万行的表,没有主键。在添加主键之前,我想检查表是否有重复的条目。当我说重复条目时,我不仅仅是指特定列上的重复。我的意思是重复整个行。
I have a table with 40 million rows and no primary key. Before I add the primary key, I would like to check if the table has any duplicate entries. When I say duplicate entries, I don't just mean duplicate on particular columns. I mean duplicates on entire rows.
我在最后一个问题中被告知我可以执行 EXISTS
查询确定重复。我该怎么做?
I was told in my last question that I can do an EXISTS
query to determine duplicates. How would I do that?
我正在运行PostgreSQL 8.1.22。 (通过运行选择版本()
获得此信息)。
I am running PostgreSQL 8.1.22. (Got this info by running select version()
).
推荐答案
要查找是否存在 任何 列),这可能是最快的方式:
To find whether any full duplicate exists (identical on all columns), this is probably the fastest way:
SELECT EXISTS (
SELECT 1
FROM tbl t
NATURAL JOIN tbl t1
WHERE t.ctid <> t1.ctid
)
NATURAL JOIN
是一个非常方便的简写,因为(引用手册):
NATURAL JOIN
is a very convenient shorthand for the case because (quoting the manual here):
NATURAL
是USING
列表的缩写,该列表提及
两个表中所有列的列
NATURAL
is shorthand for aUSING
list that mentions all columns in the two tables that have the same names.
EXISTS
可能是最快的,becau一旦找到第一个副本,Post Post会立即停止搜索。因为你很可能没有一个覆盖整行的索引,而且你的表格很大,这样可以节省你很多时间。
EXISTS
is probably fastest, because Postgres stops searching as soon as the first duplicate is found. Since you most probably don't have an index covering the whole row and your table is huge, this will save you a lot of time.
请注意, NULL
是从不认为与另一个 NULL $ c $相同C>。如果您有
NULL
值,并认为它们相同,则必须执行更多操作。
Be aware that NULL
is never considered identical to another NULL
. If you have NULL
values and consider them identical, you'd have to do more.
ctid
是一个系统列可以(ab-)用作临时主键,但长期来看不能替换实际的用户定义的主键。
ctid
is a system column that can be (ab-)used as ad-hoc primary key, but cannot replace an actual user-defined primary key in the long run.
过期版本8.1似乎没有为 ctid定义的
。尝试转换为<>
运算符文本
:
The outdated version 8.1 seems to have no <>
operator defined for a ctid
. Try casting to text
:
SELECT EXISTS (
SELECT 1
FROM tbl t
NATURAL JOIN tbl t1
WHERE t.ctid::text <> t1.ctid::text
)
这篇关于检查大表中的完整重复行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!