SQL查询查找完全相同和近似重复 [英] SQL Query Find Exact and Near Dupes

查看:221
本文介绍了SQL查询查找完全相同和近似重复的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含名字,姓氏,Add1和其他字段的SQL表.我正在努力清理此数据.有一些可能存在欺骗的情况-

I have a SQL table with FirstName, LastName, Add1 and other fields. I am working to get this data cleaned up. There are a few instances of likely dupes -

  1. 所有3列对于1条以上的记录都是完全相同的
  2. 第一个和最后一个相同,只有一个具有地址,另一个为空
  3. 第一个和最后一个相似(John | Doe vs John C. | Doe),地址相同或一个为空

我想生成一个查询,可以提供给用户,以便他们可以检出这些记录,比较他们的相关记录,然后删除他们不需要的记录.

I'm wanting to generate a query I can provide to the users, so they can check these records out, compare their related records and then delete the one they don't need.

我一直在研究相似性函数,soundex等,但是看起来都很复杂.有没有简单的方法可以做到这一点?

I've been looking at similarity functions, soundex, and such, but it all seems so complicated. Is there an easy way to do this?

谢谢!

这是一些示例数据:

FirstName    |   LastName    |      Add1
John         |   Doe         |   1 Main St
John         |   Doe         |   
John A.      |   Doe         |   
Jane         |   Doe         |   2 Union Ave
Jane B.      |   Doe         |   2 Union Ave
Alex         |   Smith       |   3 Broad St
Chris        |   Anderson    |   4 South Blvd
Chris        |   Anderson    |   4 South Blvd

我真的很喜欢Critical Error的查询来识别所有不同类型的骗子.这将为我提供以上示例数据,其中不包括Alex Smith的结果,因为没有重复数据.

I really like Critical Error's query for identifying all different types of dupes. That would give me the above sample data, with the Alex Smith result not included, because there are no dupes for that.

我想要做的是获取该结果集,并确定简·杜(Jane Doe)的受骗对象.她应该只有2个骗子.约翰·多伊(John Doe)有3个,克里斯·安德森(Chris Anderson)有2个.我可以得到该子结果集吗?

What I want to do is take that result set and identify which are dupes for Jane Doe. She should only have 2 dupes. John Doe has 3, and Chris Anderson has 2. Can I get at that sub-result set?

我知道了!我将把严重错误"的答案标记为解决方案,因为它使我完全可以找到所需的位置.这是解决方案,以防可能对其他人有所帮助.基本上,这就是我们正在做的.

I figured it out! I will be marking Critical Error's answer as the solution, since it totally got me where I needed to go. Here is the solution, in case it might help others. Basically, this is what we are doing.

  1. 从有重复的表中选择记录
  2. 添加WHERE EXISTS子查询以在同一表中查找完全相同的重复项,其中主查询和子查询的ID不匹配
  3. 使用重复列之间的差异因子(其中主查询和子查询的ID不匹配),添加WHERE EXISTS子查询以在同一表中查找相似的重复对象
  4. 添加WHERE EXISTS子查询以在同一表中查找2个字段上的重复项,其中其中一个记录的第三个字段可能为null,而主查询和子查询的ID不匹配
  5. 每个子查询都与一个OR关联,因此可以找到任何重复项
  6. 在每个子查询的末尾添加一个嵌套的要求,即主查询或子查询都是要为其标识重复项的记录的ID.

    DECLARE @CID AS INT
    
    SET ANSI_NULLS ON
    
        SET NOCOUNT ON;
        SET @CID = 12345
    BEGIN
    
    
    SELECT
        *
    FROM @Customers c
    WHERE
        -- Exact duplicates.
        EXISTS (
    
            SELECT * FROM @Customers x WHERE
                x.FirstName = c.FirstName 
                AND x.LastName = c.LastName 
                AND x.Add1 = c.Add1 
                AND x.Id <> c.Id
                AND (x.ID = @CID OR c.ID = @CID)
        
        )
        -- Match First/Last name are same/similar and the address is same.
        OR EXISTS (
        
            SELECT * FROM @Customers x WHERE
                DIFFERENCE( x.FirstName, c.FirstName ) = 4 
                AND DIFFERENCE( x.LastName, c.LastName ) = 4
                AND x.Add1 = c.Add1
                AND x.Id <> c.Id
                AND (x.ID = @CID OR c.ID = @CID)
    
        )
        -- Match First/Last name and one address exists.
        OR EXISTS (
        
            SELECT * FROM @Customers x WHERE
                x.FirstName = c.FirstName 
                AND x.LastName = c.LastName
                AND x.Id <> c.Id
                AND (
                    x.Add1 IS NULL AND c.Add1 IS NOT NULL
                    OR
                    x.Add1 IS NOT NULL AND c.Add1 IS NULL
                )
                AND (x.ID = @CID OR c.ID = @CID)
        );

推荐答案

假设记录之间有唯一的ID,可以尝试一下:

Assuming you have a unique id between records, you can give this a try:

DECLARE @Customers table ( FirstName varchar(50), LastName varchar(50), Add1 varchar(50), Id int IDENTITY(1,1) );
INSERT INTO @Customers ( FirstName, LastName, Add1 ) VALUES
    ( 'John', 'Doe', '123 Anywhere Ln' ),
    ( 'John', 'Doe', '123 Anywhere Ln' ),
    ( 'John', 'Doe', NULL ),
    ( 'John C.', 'Doe', '123 Anywhere Ln' ),
    ( 'John C.', 'Doe', '15673 SW Liar Dr' );

SELECT
    *
FROM @Customers c
WHERE
    -- Exact duplicates.
    EXISTS (

        SELECT * FROM @Customers x WHERE
            x.FirstName = c.FirstName 
            AND x.LastName = c.LastName 
            AND x.Add1 = c.Add1 
            AND x.Id <> c.Id
    
    )
    -- Match First/Last name are same/similar and the address is same.
    OR EXISTS (
    
        SELECT * FROM @Customers x WHERE
            DIFFERENCE( x.FirstName, c.FirstName ) = 4 
            AND DIFFERENCE( x.LastName, c.LastName ) = 4
            AND x.Add1 = c.Add1
            AND x.Id <> c.Id

    )
    -- Match First/Last name and one address exists.
    OR EXISTS (
    
        SELECT * FROM @Customers x WHERE
            x.FirstName = c.FirstName 
            AND x.LastName = c.LastName
            AND x.Id <> c.Id
            AND (
                x.Add1 IS NULL AND c.Add1 IS NOT NULL
                OR
                x.Add1 IS NOT NULL AND c.Add1 IS NULL
            )
    );

返回

+-----------+----------+-----------------+----+
| FirstName | LastName |      Add1       | Id |
+-----------+----------+-----------------+----+
| John      | Doe      | 123 Anywhere Ln |  1 |
| John      | Doe      | 123 Anywhere Ln |  2 |
| John      | Doe      | NULL            |  3 |
| John C.   | Doe      | 123 Anywhere Ln |  4 |
+-----------+----------+-----------------+----+

初始结果集:

+-----------+----------+------------------+----+
| FirstName | LastName |       Add1       | Id |
+-----------+----------+------------------+----+
| John      | Doe      | 123 Anywhere Ln  |  1 |
| John      | Doe      | 123 Anywhere Ln  |  2 |
| John      | Doe      | NULL             |  3 |
| John C.   | Doe      | 123 Anywhere Ln  |  4 |
| John C.   | Doe      | 15673 SW Liar Dr |  5 |
+-----------+----------+------------------+----+

这篇关于SQL查询查找完全相同和近似重复的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆