从大型数据集中删除重复项(> 100Mio行) [英] Delete duplicates from large dataset (>100Mio rows)

查看:128
本文介绍了从大型数据集中删除重复项(> 100Mio行)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道这个话题在这里出现了很多次,但是没有一个建议的解决方案适用于我的数据集,因为笔记本电脑由于内存问题或完全存储而停止计算。

I know that this topic came up many times before here but none of the suggested solutions worked for my dataset because my laptop stopped calculating due to memory issues or full storage.

我的表格如下所示,并有 108 Mio行:

My table looks like the following and has 108 Mio rows:

Col1       |Col2   |  Col3           |Col4   |SICComb |  NameComb 

Case New   |3523   |  Alexander      |6799   |67993523| AlexanderCase New 
Case New   |3523   |  Undisclosed    |6799   |67993523| Case NewUndisclosed 
Undisclosed|6799   |  Case New       |3523   |67993523| Case NewUndisclosed 
Case New   |3523   |  Undisclosed    |6799   |67993523| Case NewUndisclosed 
SmartCard  |3674   |  NEC            |7373   |73733674| NECSmartCard 
SmartCard  |3674   |  Virtual NetComm|7373   |73733674| SmartCardVirtual NetComm 
SmartCard  |3674   |  NEC            |7373   |73733674| NECSmartCard

唯一的列是 SICComb code> NameComb 。我试图添加一个主键:

The unique columns are SICComb and NameComb. I tried to add a primary key with:

ALTER TABLE dbo.test ADD ID INT IDENTITY(1,1)

但是整数填充超过 30 GB

but the integers are filling up more than 30 GB of my storage just in a new minutes.

哪些是从表中删除重复项的最快最有效的方法?

Which would be the fastest and most efficient method to delete the duplicates from the table?

推荐答案

通常,从表中删除重复的最快方法是将记录(无重复)插入到临时表中,截断原始表并插入它们回到。

In general, the fastest way to delete duplicates from a table is to insert the records -- without duplicates -- into a temporary table, truncate the original table and insert them back in.

这是一个想法,使用SQL Server语法:

Here is the idea, using SQL Server syntax:

select distinct t.*
into #temptable
from t;

truncate table t;

insert into t
    select tt.*
    from #temptable;

当然这在很大程度上取决于第一步是多快。而且,您需要有空间来存储同一张表的两个副本。

Of course, this depends to a large extent on how fast the first step is. And, you need to have the space to store two copies of the same table.

请注意,创建临时表的语法因数据库而异。有些使用 create table的语法而不是 select into

Note that the syntax for creating the temporary table differs among databases. Some use the syntax of create table as rather than select into.

编辑:

您的身份插入错误很麻烦。我认为你需要从列的列表中删除不同的标识。或者:

Your identity insert error is troublesome. I think you need to remove the identity from the list of columns for the distinct. Or do:

select min(<identity col>), <all other columns>
from t
group by <all other columns>

如果您有一个标识列,则没有重复(根据定义)。

If you have an identity column, then there are no duplicates (by definition).

最后,您将需要决定要为这些行使用哪个ID。如果您可以为行生成一个新的ID,那么只需将标识列从列列表中删除:

In the end, you will need to decide which id you want for the rows. If you can generate a new id for the rows, then just leave the identity column out of the column list for the insert:

insert into t(<all other columns>)
    select <all other columns>;

如果您需要旧的标识值(和最小值),请关闭身份插入并执行:

If you need the old identity value (and the minimum will do), turn off identity insert and do:

insert into t(<all columns including identity>)
    select <all columns including identity>;

这篇关于从大型数据集中删除重复项(&gt; 100Mio行)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆