从数百万条记录中删除重复行的有效方法 [英] Effective way to delete duplicate rows from millions of records

查看:243
本文介绍了从数百万条记录中删除重复行的有效方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一种从数据库中删除重复记录的有效方法.首先,我使用了一个使用联接等的存储过程,这导致查询执行非常慢.现在,我正在尝试另一种方法.请考虑以下查询:

I am looking to find an effective way to delete duplicated records from my database. First, I used a stored procedure that uses joins and such, which caused the query to execute very slow. Now, I am trying a different approach. Please consider the following queries:

/* QUERY A */

SELECT *
FROM my_table
WHERE col1 = value
  AND col2 = value
  AND col3 = value

此查询仅在12秒内执行,结果为182.400条记录.该表中的行数当前为420.930.407,并且对col1和col3进行了索引.

This query just executed in 12 seconds, with a result of 182.400 records. The row count in the table is currently 420.930.407, and col1 and col3 are indexed.

下一个查询:

/* QUERY B */

WITH ALL_RECORDS AS
  (SELECT id
   FROM my_table
   WHERE col1 = value
     AND col2 = value
     AND col3 = value)
SELECT *
FROM ALL_RECORDS

此查询用了不到2秒的时间,并且为我提供了表中182.400条记录的所有ID(根据where子句).

This query took less than 2 seconds, and gives me all the id's of the 182.400 records in the table (according to the where clause).

然后,我的最后一个查询是一个查询,该查询选择在要分组以检查重复项的列上分组的所有记录的最低(第一个)id:

Then, my last query, is a query that selects the lowest (first) id of all records grouped on the columns I want to group on to check for duplicates:

/* QUERY C */

SELECT MIN(id)
FROM my_table
WHERE col1 = value
  AND col2 = value
  AND col3 = value
GROUP BY col1,
         col2,
         col3,
         col4,
         col5,
         col6

同样,此查询将在不到2秒的时间内执行.结果为30.400,这意味着在182.400个唯一记录中有30.400个唯一记录.

Again, this query executes in less than 2 seconds. The result is 30.400, which means there are 30.400 unique records among the 182.400 records that are unique.

现在,我想删除(或首先选择以确保我拥有正确的查询权限)所有非唯一记录.因此,我想从my_table中删除182.400-30.400 = 152.000条记录.

Now, I'd like to delete (or, first select to make sure I have my query right) all records that are not unique. So, I'd like to remove 182.400 - 30.400 = 152.000 records from my_table.

我以为我要结合最后两个查询:根据col1,col2和col3上的where子句获取属于我的数据集的所有id(查询B),然后从该数据集中删除/选择所有记录该ID不在唯一记录ID(查询C)的ID列表中.

I thought I'd combine the two last queries: get all id's that belong to my dataset according to the where clause on col1, col2 and col3 (query B), and then delete/select all records from that dataset of which the id is not in the id list of the unique record id's (query C).

但是,当我从查询B中选择所有查询B,其中查询B.id不是查询C时,该查询不会花2、4或12(14或16)秒,但是却要花很多时间(10.000后显示20.000条记录)分钟,大约2分钟后大约40.000,因此我取消了查询,因为它将找到152.000条记录,这种记录将花费8分钟).

However, when I select all from query B where query B.id NOT IN query C, the query does not take 2, 4 or 12 (14 or 16) seconds, but seems to take forever (20.000 records shown after 1 minute, around 40.000 after 2 minutes, so I canceled the query since it'll find 152.000 records, which will take 8 minutes this way).

WITH ALL_RECORDS AS
  (SELECT id
   FROM my_table
   WHERE col1 = value
     AND col2 = value
     AND col3 = value)
SELECT id
FROM ALL_RECORDS
WHERE id NOT IN
    (SELECT MIN(id)
     FROM my_table
     WHERE col1 = value
       AND col2 = value
       AND col3 = value
     GROUP BY col1,
              col2,
              col3,
              col4,
              col5,
              col6)

我知道NOT IN很慢,但是我无法理解它的速度如何(因为两个没有不完全执行的查询都在不到2秒的时间内执行了.)

I know NOT IN is slow, but I can't grasp how it's THIS slow (since both queries without the not in part execute in less than 2 seconds each).

在解决这个难题方面,有人对我有很好的建议吗?

Does anyone have some good advice for me on how to solve this puzzle?

------------------其他信息------------------

------------------ Additional information ------------------

先前的解决方案是以下存储过程.由于某种原因,它可以在我的验收环境中完美执行,但不能在我的生产环境中执行.目前,我们的生产记录超过4亿条,接受记录超过200万条,这可能是一个原因.

Previous solution was the following stored procedure. For some reason it executes perfectly on my acceptance environment, but not on my production environment. Currently, we have over 400 million records on production and a little over 2 million records on acceptance, so this might be a reason.

DELETE my_table
FROM my_table
LEFT OUTER JOIN
  (SELECT MIN(id) AS RowId,
          col1,
          col2,
          col3,
          col4,
          col5,
          col6
   FROM my_table
   WHERE col1 = value
     AND col2 = value
     AND col3 = value
   GROUP BY col1,
            col2,
            col3,
            col4,
            col5,
            col6) AS KeepRows ON my_table.id = KeepRows.RowId
WHERE KeepRows.RowId IS NULL
  AND my_table.col1 = value
  AND my_table.col2 = value
  AND my_table.col3 = value

我已经基于stackoverflow的另一个答案(目前找不到它)来解决这个问题,但是我觉得我应该能够基于查询B和C创建一个在几秒钟内执行的查询...

I have based this solution on another answer on stackoverflow (can't find it at the moment), but I feel I should be able to create a query based on Query B and C that executes within a few seconds...

推荐答案

with dupl as (
select row_number() over(partition by col1,col2,col3,col4,col5,col6 order by id) rn,
id,col1,col2,col3,col4,col5,col6
from myTable
)
delete dupl where rn>1

这篇关于从数百万条记录中删除重复行的有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆