一个非常复杂的SQL查询问题 [英] A very complicated SQL query issue

查看:25
本文介绍了一个非常复杂的SQL查询问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两张桌子......

I have 2 tables ...

  • 客户
  • 客户识别

客户表有 2 个字段

  • CustomerId varchar(20)
  • Customer_Id_Link varchar(50)

CustomerIdentification 表有 3 个字段

CustomerIdentification table has 3 fields

  • CustomerId varchar(20)
  • Identification_Number varchar(50)
  • Personal_ID_Type_Code int -- 是另一个表的外键,但这无关紧要

基本上,Customer 是客户主表(以 CustomerID 作为主键)并且 CustomerIdentification 可以为给定的客户提供多个标识.换句话说,CustomerIdentification 中的 CustomerId 是 Customer 表的外键.一个客户可以拥有多份身份证明,每一份都有一个 Identification_NumberPersonal_ID_Type_Code(这是一个整数,用于告诉您身份证明是护照、罪证、驾照等.).

Basically, Customer is the customer master table (with CustomerID as primary key) and CustomerIdentification can have several pieces of identifications for a given customer. In other words, CustomerId in CustomerIdentification is a foriegn key to Customer table. A customer can have many pieces of identifications, each having a Identification_Number and Personal_ID_Type_Code (which is an integer that tells you whether the identification is a passport, sin, drivers license etc.).

现在,客户表有以下数据:Customer_Id_Link此时为空(空字符串)

Now, customer table has the following data: Customer_Id_Link is blank (empty string) at this point

CustomerId      Customer_Id_Link
--------------------------------
 'CU-1'         <Blank>
 'CU-2'         <Blank>
 'CU-3'         <Blank>
 'CU-4'         <Blank>
 'CU-5'         <Blank>

和 CustomerIdentification 表有以下数据:

and CustomerIdentification table has the following data:

CustomerId    Identification_Number    Personal_ID_Type_Code
------------------------------------------------------------
'CU-1'        'A'                      1
'CU-1'        'A'                      2
'CU-1'        'A'                      3
'CU-2'        'A'                      1
'CU-2'        'B'                      3
'CU-2'        'C'                      4
'CU-3'        'A'                      1
'CU-3'        'B'                      2
'CU-3'        'C'                      4
'CU-4'        'A'                      1
'CU-4'        'B'                      2
'CU-4'        'B'                      3
'CU-5'        'B'                      3

基本上,多个客户可以在 CustomerIdentification 中拥有相同的 Identification_NumberPersonal_ID_Type_Code.发生这种情况时,所有 Customer_Id_Link 字段都需要更新为一个通用值(可以是 GUID 或其他).但对此的处理更为复杂.

Essentially, more than one customer can have same Identification_Number and Personal_ID_Type_Code in CustomerIdentification. When this happens, all Customer_Id_Link fields need to be updated with a common value (could be a GUID or whatever). But the processing for this is more complex.

规则如下:

用于匹配客户记录之间的 Personal_ID_Type_CodeIdentification_Number 字段- 比较上述匹配中所有客户记录的所有其他常见 Personal_ID_Type_Code 字段的 Identification_Number 字段- 如果为真,则链接客户记录

For matching Personal_ID_Type_Code and Identification_Number fields between Customer Records - Compare the Identification_Number fields for all other common Personal_ID_Type_Code fields for all the Customer Records from the above match - if true, then link the Customer Records

例如:

CU-1、CU-2、CU-3、CU-4 的匹配 ID 1 A

Match ID 1 A for CU-1, CU-2, CU-3, CU-4

  • 异常 ID 2 不匹配(CU-1 上的 A 与 CU-3 上的 B)
  • 未建立关联

CU-3、CU-4 的匹配 ID 2 B

Match ID 2 B for CU-3, CU-4

  • 没有 ID 不匹配
  • 链接 CU-3 和 CU-4(更新 Customer_Id_Link 字段,在客户表中为两者使用一个共同的值)
  • No ID mismatch
  • Link CU-3 and CU-4 (update Customer_Id_Link field with a common value in customer table for both)

CU-1、CU-4 的匹配 ID 3 A

Match ID 3 A for CU-1, CU-4

  • 异常 ID 2 不匹配(A 与 B)
  • 未建立关联

CU-2、CU-5 的匹配 ID 3 B

Match ID 3 B for CU-2, CU-5

  • 没有 ID 不匹配
  • 链接 CU-2 和 CU-5(更新 Customer_Id_Link 字段,在两者的客户表中使用一个共同的值)为 CU-2、CU-3 匹配 ID 4 C
  • CU-2 已链接,将 CU-5 保留到客户链接列表中
  • CU-3 已链接,将 CU-4 保留到客户链接列表中
  • 异常 ID 3 不匹配(CU-2 上的 B 与 CU-4 上的 A)
  • 未完成链接(保留之前的链接)
  • No ID mismatch
  • Link CU-2 and CU-5 (update Customer_Id_Link field with a common value in customer table for both) Match ID 4 C for CU-2, CU-3
  • CU-2 already linked, keep CU-5 to customer linking list
  • CU-3 already linked, keep CU-4 to customer linking list
  • Exception ID 3 mismatch (B on CU-2 vs A on CU-4)
  • No linkage done (previous linkage remains)

任何帮助将不胜感激.这让我醒了两天,我似乎无法找到解决方案.理想情况下,该解决方案将是一个存储过程,我可以执行它来进行客户链接.

Any help will be appreciated. This has kept me awake for two days now, and I cant seem to be able to find the solution. Ideally, the solution will be a stored procedure that I can execute to do customer linking.

- SQL Server 2008 R2 标准 64 位

 - SQL Server 2008 R2 Standard 64 bit

更新------------------------------

我知道解释这个问题会很困难,所以我承担了责任.但本质上,我希望能够链接所有具有相同标识号的客户,只有一个客户可以拥有 1 个以上的标识号.以示例 1. 1 A(1 是 Personal_id_type_code 和 A 是标识号存在于 4 个不同的客户.CU-1、CU-2、CU-3、CU-4.因此它们可能是在 4 个不同时间存在的同一客户具有不同客户 ID 的客户表.我们需要将它们与 1 个共同值联系起来.但是,CU-1 有 2 个其他标识,如果其中 1 个与其他 3 个不同(CU-2、CU-3、CU-4) 他们不是同一个客户.因此,编号为 A 的 ID 2 与 CU-3(其 B)的 ID 2 不匹配,并且与 CU-4 相同.此外,即使 CU-2 中不存在 ID 2 num A, CU-1 的 ID 3 和 num A 与 CU-2 的 ID 3(其 B)不匹配.因此根本不匹配.

I knew it was going to be tough to explain this problem, so I take the blame. But essentially, I want to be able to link all the customers that have same identificationNumbers, only, a customer can have more than 1 identificationNumber. Take example 1. 1 A (1 being Personal_id_type_code and A being identificationNumber exists for 4 different customers. CU-1, CU-2, CU-3, CU-4. So they could potentially be the same customer that exists 4 different times in customer table with different customer ID. We need to link them with 1 common value. However, CU-1 has 2 other identifications and if even 1 of them is different from the other 3 (CU-2, CU-3, CU-4) they are not the same customer. So ID 2 with Num A does not match with ID 2 for CU-3 (its B) and same for CU-4. Also, even though ID 2 num A does not exist in CU-2, CU-1's ID 3 and num A does not match with CU-2s ID 3 (its B). Therefore its not a match at all.

下一个常见的 Id 和 num 是 2-b,它存在于 CU-3 和 CU-4 中.这两个客户实际上是相同的,因为两者都有 ID 1 - A 和 ID 2 - B.ID 4 - C 和 ID 3 - A 无关紧要,因为两个 ID 不同.这实质上意味着这个客户有 4 个 ID I A、2 B、4 C 和 3 A.所以现在我们需要将这个客户与客户表中的一个共同的唯一值 (guid) 联系起来.

Next common Id's and num is 2-b which exists in CU-3 and CU-4. These two customers are in fact same cause both have ID 1 - A and ID 2 - B. ID 4 - C and ID 3 - A is irrelevant cause both IDs are different. Which essentially means this customer has 4 IDs I A, 2 B, 4 C and 3 A. So now we need to link this customer with a common unique value (guid) in customer table.

我希望我现在解释了这个非常复杂的问题.很难解释,因为这是一个非常独特的问题.

I hope I explained this very complicated issue now. It is tough to explain as this is a very unique problem.

推荐答案

我对您的数据模型进行了一些更改,以尝试使其更加明显地显示正在发生的事情..

I've changed your data model a bit to try and make it a bit more obvious what's going on..

CREATE TABLE [dbo].[Customer]
(
    [CustomerName]      VARCHAR(20)     NOT NULL,
    [CustomerLink]      VARBINARY(20)   NULL
)

CREATE TABLE [dbo].[CustomerIdentification]
(
    [CustomerName]      VARCHAR(20)     NOT NULL,
    [ID]                VARCHAR(50)     NOT NULL,
    [IDType]            VARCHAR(16)     NOT NULL
)

而且我添加了更多测试数据..

And I've added some more test data..

INSERT  [dbo].[Customer]
        ([CustomerName])
VALUES  ('Fred'),
        ('Bob'),
        ('Vince'),
        ('Tom'),
        ('Alice'),
        ('Matt'),
        ('Dan')

INSERT  [dbo].[CustomerIdentification]
VALUES  
        ('Fred',    'A',    'Passport'),
        ('Fred',    'A',    'SIN'),
        ('Fred',    'A',    'Drivers Licence'),
        ('Bob',     'A',    'Passport'),
        ('Bob',     'B',    'Drivers Licence'),
        ('Bob',     'C',    'Credit Card'),
        ('Vince',   'A',    'Passport'),
        ('Vince',   'B',    'SIN'),
        ('Vince',   'C',    'Credit Card'),
        ('Tom',     'A',    'Passport'),
        ('Tom',     'B',    'SIN'),
        ('Tom',     'B',    'Drivers Licence'),
        ('Alice',   'B',    'Drivers Licence'),
        ('Matt',    'X',    'Drivers Licence'),
        ('Dan',     'X',    'Drivers Licence')

这是您要找的吗:

;WITH [cteNonMatchingIDs] AS (
    -- Pairs where the IDType is the same, but 
    -- name and ID don't match
    SELECT  ci3.[CustomerName] AS [CustomerName1],
            ci4.[CustomerName] AS [CustomerName2]
    FROM [dbo].[CustomerIdentification] ci3
    INNER JOIN [dbo].[CustomerIdentification] ci4
        ON ci3.[IDType] = ci4.[IDType]
    WHERE ci3.[CustomerName] <> ci4.[CustomerName]
    AND ci3.[ID] <> ci4.[ID]
),
[cteMatchedPairs] AS (
    -- Pairs where the IDType and ID match, and
    -- there aren't any non matching IDs for the
    -- CustomerName
    SELECT DISTINCT 
            ci1.[CustomerName] AS [CustomerName1],
            ci2.[CustomerName] AS [CustomerName2]
    FROM [dbo].[CustomerIdentification] ci1
    LEFT JOIN [dbo].[CustomerIdentification] ci2
        ON ci1.[CustomerName] <> ci2.[CustomerName]
        AND ci1.[IDType] = ci2.[IDType] 
    WHERE ci1.[ID] = ISNULL(ci2.[ID], ci1.[ID])
    AND NOT EXISTS (
        SELECT 1
        FROM [cteNonMatchingIDs]
        WHERE ci1.[CustomerName] = [CustomerName1] -- correlated subquery
        AND ci2.[CustomerName] = [CustomerName2]
    )
    AND ci1.[CustomerName] < ci2.[CustomerName]
),
[cteMatchedList] ([CustomerName], [CustomerNameList]) AS (
    -- Turn the matched pairs into list of matching
    -- CustomerNames
    SELECT  [CustomerName1],
            [CustomerNameList]
    FROM (
        SELECT  [CustomerName1],
                CONVERT(VARCHAR(1000), '$'
                 + [CustomerName1] + '$'
                 + [CustomerName2]) AS [CustomerNameList]
        FROM [cteMatchedPairs]
        UNION ALL
        SELECT  [CustomerName2],
                CONVERT(VARCHAR(1000), '$'
                 + [CustomerName2]) AS [CustomerNameList]
        FROM [cteMatchedPairs]
    ) [cteMatchedPairs]
    UNION ALL
    SELECT  [cteMatchedList].[CustomerName],
            CONVERT(VARCHAR(1000),[CustomerNameList] + '$'
             + [cteMatchedPairs].[CustomerName2])
    FROM [cteMatchedList] -- recursive CTE
    INNER JOIN [cteMatchedPairs]
        ON RIGHT([cteMatchedList].[CustomerNameList],
         LEN([cteMatchedPairs].[CustomerName1])
        ) = [cteMatchedPairs].[CustomerName1]
),
[cteSubstringLists] AS (
    SELECT  r1.[CustomerName],
            r2.[CustomerNameList]
    FROM [cteMatchedList] r1
    INNER JOIN [cteMatchedList] r2
        ON r2.[CustomerNameList] LIKE '%' + r1.[CustomerNameList] + '%'
),
[cteCustomerLink] AS (
    SELECT DISTINCT 
            x1.[CustomerName],
            HASHBYTES('SHA1', x2.[CustomerNameList]) AS [CustomerLink]
    FROM (
        SELECT  [CustomerName],
                MAX(LEN([CustomerNameList])) AS [MAX LEN CustomerList]
        FROM [cteSubstringLists]
        GROUP BY [CustomerName]
    ) x1
    INNER JOIN (
        SELECT  [CustomerName],
                LEN([CustomerNameList]) AS [LEN CustomerList], 
                [CustomerNameList]
        FROM [cteSubstringLists]
    ) x2
        ON x1.[MAX LEN CustomerList] = x2.[LEN CustomerList]
        AND x1.[CustomerName] = x2.[CustomerName]
)
UPDATE  c
SET     [CustomerLink] = cl.[CustomerLink]
FROM [dbo].[Customer] c
INNER JOIN [cteCustomerLink] cl
    ON cl.[CustomerName] = c.[CustomerName]


SELECT *
FROM [dbo].[Customer]

这篇关于一个非常复杂的SQL查询问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆