使用分组依据，等级，行号重复删除 [英] Duplicates removal using Group By, Rank, Row_Number

查看：92 发布时间：2020/8/1 20:01:41 sql-server group-by duplicates ranking row-number

本文介绍了使用分组依据，等级，行号重复删除的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有两个桌子.一个是CustomerOrders，另一个是OrderCustomerRef-查找表.

I have two tables. One is CustomerOrders and the other is OrderCustomerRef - lookup table.

两个表都具有一对多关系-一个客户可能与多个订单相关联.

Both tables have one-to-many relationship - one customer may be associated with multiple orders.

CustomerOrders表具有重复的客户(相同的LName，FName，Email).但是它们具有不同的Cust_ID.

CustomerOrders table has duplicate Customers (same LName, FName, Email). But they have different Cust_IDs.

我需要合并基本Customer表中的所有重复联系人(一对一). (此表未在此处显示).

I need to merge all duplicate contacts in the base Customer table (one-to-one). (this table is not shown here).

第1步:

需要找出应将哪个Cust_ID合并到相应的重复客户(相同的LName，FName，Email)中.具有最新Order_Date的联系人应赢得对应的重复副本(客户). VIP客户是一个例外-无论Order_Date如何，他们都应该一直是获胜者.

Need to find out which Cust_ID should be merged into which corresponding duplicate Customer(s) (same LName, FName, Email). A Contact with latest Order_Date should win over it's corresponding duplicate counterpart (Customer). An exception will be for VIP Customers - they should always be the winning ones regardless of an Order_Date.

第2步: 更新了OrderCustomerRef表:将所有丢失的重复Cust_ID替换为获胜的Cust_ID.

Step 2: Updated OrderCustomerRef table: replace all losing duplicate Cust_IDs with the winning Cust_IDs.

第3步: 从基本客户"表中删除所有丢失的联系人(在当前范围内为否.我自己做).

Step 3: Delete all losing Contacts from the base Customer table (no in the current scope. I will do it myself).

IF OBJECT_ID('tempdb..#table') IS NOT NULL
DROP TABLE #table;

IF OBJECT_ID('tempdb..#CustomerOrders') IS NOT NULL
DROP TABLE #CustomerOrders;

IF OBJECT_ID('tempdb..#OrderCustomerRef') IS NOT NULL
DROP TABLE #OrderCustomerRef;

CREATE TABLE #CustomerOrders 
(
[PK_ID] INT NOT NULL PRIMARY KEY IDENTITY(1,1),
Cust_ID INT NOT NULL, 
LName VARCHAR(100) NULL, 
FName VARCHAR(100) NULL, 
[Customer_E-mail] VARCHAR(100) NULL,
Order_Date DATETIME NULL,
Customer_Source VARCHAR(100) NULL,
CustomerType VARCHAR(100) NULL
)

INSERT INTO #CustomerOrders (Cust_ID, LName, FName, [Customer_E-mail], Order_Date, Customer_Source, CustomerType)
VALUES 
(1, 'John', 'Smith', 'JSmith@email.com', '2018-11-10 01:40:55.150', 'XYZ Company', 'Regular'),
(2, 'John', 'Smith', 'JSmith@email.com', '2018-10-10 05:05:55.150', 'Internet', 'VIP'),
(3, 'Adam', 'Burns', 'ABurns@email.com', '2017-05-05 00:00:00.000', 'XYZ Company','Regular'),
(3, 'Adam', 'Burns', 'ABurns@email.com', '2017-05-05 00:00:00.000', 'XYZ Company','VIP'),
(4, 'Adam', 'Burns', 'ABurns@email.com', '2017-05-05 00:00:00.000', 'Internet','Regular'),
(5, 'Adam', 'Burns', 'ABurns@email.com', '2017-05-05 00:00:00.000', 'Internet','VIP'),
(6, 'James', 'Snatcher', 'JSnatcher@email.com', '2019-07-07 00:00:00.000', 'XYZ Company', 'Regular'),
(7, 'James', 'Snatcher', 'JSnatcher@email.com', '2019-07-07 00:00:00.000', 'Internet','Regular'),
(9, 'Thomas', 'Johnson', 'TJohnson@email.com', '2016-05-01 00:00:00.000', 'Internet','Regular'),
(9, 'Thomas', 'Johnson', 'TJohnson@email.com', '2015-04-01 00:00:00.000', 'Internet','Regular'),
(10, 'Thomas', 'Johnson', 'TJohnson@email.com', '2014-03-01 00:00:00.000', 'Internet','Regular'),
(11, 'Thomas', 'Johnson', 'TJohnson@email.com', '2013-02-01 00:00:00.000', 'XYZ Company','Regular'),
(12, 'Peter', 'McDonald', 'PMcDonald@email.com', '2013-02-01 00:00:00.000', 'XYZ Company','Regular'),
(13, 'Jose', 'Mainster', 'JMainster@email.com', '2013-02-01 00:00:00.000', 'Internet','Regular'),
(14, 'Kevin', 'Digginton', 'KDigginton@email.com', '2013-02-01 00:00:00.000', 'Internet','Regular'),
(14, 'Kevin', 'Digginton', 'KDigginton@email.com', '2015-09-03 00:00:00.000', 'Internet','Regular')

CREATE TABLE #OrderCustomerRef
(
    Raw_PK INT NOT NULL PRIMARY KEY IDENTITY(1,1),
    OrderID INT NOT NULL, 
    Cust_ID INT NULL, 
    OrderType VARCHAR(100) NULL
)

    INSERT INTO #OrderCustomerRef (OrderID, Cust_ID, OrderType)
    VALUES 
    (1,1,'Online'),
    (2,2,'Online'),
    (3,3,'Online'),
    (4,3,'Online'),
    (5,4,'In Store'),
    (6,5,'Online'),
    (7,6,'Online'),
    (8,7,'In Store'),
    (9,9,'Online'),
    (10,9,'Online'),
    (11,10,'In Store'),
    (12,11,'Online'),
    (13,12,'Online'),
    (14,13,'Online'),
    (15,14,'Online'),
    (16,14,'In Store')

    -- SELECT * FROM #OrderCustomerRef

    SELECT *,
    RANK() OVER (PARTITION BY FName, LName, [Customer_E-mail], Customer_Source ORDER BY Order_Date DESC) AS Rank_1,
    RANK() OVER (PARTITION BY FName, LName, [Customer_E-mail], Customer_Source ORDER BY Order_Date, CustomerType DESC ) AS Rank_CustType,
    RANK() OVER (PARTITION BY Cust_ID, FName, LName, [Customer_E-mail], Customer_Source ORDER BY Order_Date, CustomerType DESC ) AS Rank_CustID,
    RANK() OVER (PARTITION BY FName, LName, [Customer_E-mail] ORDER BY Order_Date DESC) AS Rank_2,
    RANK() OVER (PARTITION BY FName, LName, [Customer_E-mail] ORDER BY Cust_ID) AS Rank_3
    FROM #CustomerOrders

所需的输出看起来像:

*例外: -丢失客户ID 1、3(应该赢了，但是由于副本重复，因此是VIP，所以正在丢失) -赢得客户ID 2、5(因为它是VIP，有例外)

*exception: - losing Customer IDs 1, 3 (should be winning, but since there is a duplicate counterpart it's a VIP it's losing) - winning Customer IDs 2, 5 (because it's a VIP, subject to exception)

例如:## OrderCustomerRef中所有Cust_ID为 1 的John Smith Cust_ID都应替换为Cust_ID为 2 的John Smith.应该将Cust_ID为3的Adam Burns的Cust_ID替换为Cust_ID为5的Adam Burns.

Eg.: All occurences of Cust_ID of John Smith with Cust_ID of 1 in the ##OrderCustomerRef should be replaced with John Smith with Cust_ID of 2, all occurances of Cust_ID of Adam Burns with Cust_ID of 3 should be replaced with Adam Burns with Cust_ID of 5

一般规则: -丢失客户ID ，7、10、11、4 -赢得客户ID ，6、9、12、13、14

general rule: - losing Customer IDs 7, 10, 11, 4 - winning Customer IDs 6, 9, 12, 13, 14

例如:## OrderCustomerRef中所有出现的Cust_ID为7都应替换为6，所有出现的Cust_ID为10都应替换为9 *

Eg.: All occurences of Cust_ID of 7 in the ##OrderCustomerRef should be replaced with 6, all occurances of Cust_ID of 10 should be replaced with 9*

最终，我应该在## OrderCustomerRef表中仅拥有客户ID 6、9、12、13、14、2、5.

使用Rank_CustType_1，column_1和column_2，我可以弄清楚步骤1. 但是我仍然对第2步有问题-像这样更新OrderCustomerRef表:应将所有丢失的Cust_ID替换为对应的重复的获胜Cust_ID.

Using Rank_CustType_1, column_1, column_2 I can figure out Step 1. But I still have a problem with Step 2 - updating OrderCustomerRef table as such: all losing Cust_IDs should be replaced with corresponding duplicate winning Cust_IDs.

我已经尝试过了.但这仍然不能代替丢失的Cust_ID.

I've tried this. But that still does not replace losing Cust_ID.

SELECT *,
    RANK() OVER (PARTITION BY FName, LName, [Customer_E-mail] ORDER BY Order_Date, CustomerType DESC) AS Rank_CustType_1,
    RANK() OVER (PARTITION BY FName, LName, [Customer_E-mail] ORDER BY Cust_ID) AS Rank_3
INTO #table
FROM #CustomerOrders

; with cte as (
    select Cust_ID, FName, LName, [Customer_E-mail], max(t.Rank_CustType_1) as Rank_CustType_1
    ,(select distinct Cust_ID from #table a where a.Cust_ID = t.Cust_ID and Rank_3 = 1) column_1
    ,(select distinct Cust_ID from #table a where a.Cust_ID = t.Cust_ID and Rank_3 <> 1) column_2

from #table t
group by Cust_ID, FName, LName, [Customer_E-mail]
    )

    update b
    set Cust_ID = case  
    when b.Cust_ID = cte.Cust_ID and
     b.Cust_ID = ISNULL(cte.column_1,'') and Rank_CustType_1 != 1 then b.Cust_ID 
    when b.Cust_ID = cte.Cust_ID and
     b.Cust_ID = ISNULL(cte.column_2,'') and Rank_CustType_1 != 1 then cte.column_2     
    when b.Cust_ID = cte.Cust_ID and Rank_CustType_1 = 1 and cte.column_1 is null and cte.column_2 is not null then cte.column_2
    when b.Cust_ID = cte.Cust_ID and Rank_CustType_1 = 1 and cte.column_1 is not null and cte.column_2 is null then cte.column_1 
    end  
    from #OrderCustomerRef b
    inner join cte on b.Cust_ID = cte.Cust_ID;

    select * from #OrderCustomerRef;

使用分组依据，等级，行号重复删除 [英] Duplicates removal using Group By, Rank, Row_Number

问题描述

推荐答案

相关文章

数据库最新文章

热门教程

热门工具

登录关闭

使用分组依据，等级，行号重复删除 [英] Duplicates removal using Group By, Rank, Row_Number

问题描述

推荐答案

相关文章

数据库最新文章

热门教程

热门工具

登录 关闭

登录关闭