SQL组合多个标识符为重复的记录创建组ID [英] SQL combine multiple identifiers to create a group id for duplicate records
问题描述
我正在努力解决Oracle中一个难以解决的问题.
I'm working on a problem in Oracle that I'm struggling to solve 'elegantly'.
我有一个带有三个不同标识符的数据提取:A,B,C
I have a data extract with three different identifiers: A, B, C
每个标识符可能出现在不止一行中,并且每一行可能具有这三个标识符中的一个或多个(即,该列已填充或为空).
Each identifier may appear in more than one row, and each row may have one or more of these three identifiers (i.e the column is populated or null).
我希望能够对所有具有A,B或C任意组合的记录进行分组,并为其分配相同的组ID.
I want to be able to group all records that have any combination of either A, B or C in common and assign them the same group id.
提取表显示最终的组应该是什么:
Extract table showing what the eventual groups should be:
Rownum | A | B | C | End group
1 p NULL NULL 1
2 p r NULL 1
3 q NULL NULL 2
4 NULL r NULL 1
5 NULL NULL s 2
6 q NULL s 2
我最初的方法是为提取物中的每一行分配一个GUID并为三个标识符创建一个查找表:
My original approach was to assign a guid to each row in the extract and create a lookup table for the three identifiers:
GUID | IDENTIFIER | IDENTIFIER TYPE | GROUP | END GROUP
1 p A 1 1
2 p A 1 1
2 r B 2 1
3 q A 3 3
4 r B 2 1
5 s C 4 3
6 q A 3 3
6 s C 4 3
然后按标识符分组并分配一个组号.但是,在可能的情况下,需要将这些组合并以提供端组中显示的视图.
Then group by identifier and assign a group number. The groups, however, need to be combined where possible to provide the view shown in end group.
我可以想到的唯一解决方案是使用循环,而我宁愿避免使用循环.
The only solution I can think of for this problem is to use loops, which I'd rather avoid.
任何想法都将不胜感激.
Any ideas would be greatly appreciated.
Niall
推荐答案
这确实是一个有趣的问题.不过,我认为我们仍缺少组"的定义.由于在您的示例中,(p,null,null)
(row1)和(null,r,null)
(row4)没有共用标识符,并且属于同一组,因此我将使用此定义进行分组:
This is truly an interesting problem. Still, I think we are missing a definition of a "group". Since in your example (p,null,null)
(row1) and (null,r,null)
(row4) share no common identifier and belong to the same group I'll go with this definition for grouping:
如果一行与该组的至少一行共享至少一个标识符,则该行属于该组.
A row belongs to a group if it shares at least one identifier with at least one row of this group.
这意味着我们可以链接"行.这自然会导致分层解决方案:
This means we can "chain" rows. This naturally leads to a hierarchical solution:
SQL> SELECT ID, a, b, c, MIN(grp) grp
2 FROM (SELECT connect_by_root(id) ID,
3 connect_by_root(a) a,
4 connect_by_root(b) b,
5 connect_by_root(c) c,
6 ID grp
7 FROM a
8 CONNECT BY NOCYCLE(PRIOR a = a
9 OR PRIOR b = b
10 OR PRIOR c = c))
11 GROUP BY ID, a, b, c
12 ORDER BY ID;
ID A B C GRP
---------- ---------- ---------- ---------- ----------
1 p 1
2 p r 1
3 q 3
4 r 1
5 s 3
6 q s 3
6 rows selected
您可以执行子查询来了解其构造:
You can execute the subquery to understand the construction:
SQL> SELECT connect_by_root(id) ID,
2 connect_by_root(a) a,
3 connect_by_root(b) b,
4 connect_by_root(c) c,
5 substr(sys_connect_by_path(ID, '->'), 3) path,
6 ID grp
7 FROM a
8 CONNECT BY NOCYCLE(a = PRIOR a
9 OR b = PRIOR b
10 OR c = PRIOR c);
ID A B C PATH GRP
---------- ---------- ---------- ---------- -------- ----------
1 p 1 1
1 p 1->2 2
1 p 1->2->4 4
2 p r 2 2
2 p r 2->1 1
2 p r 2->4 4
3 q 3 3
3 q 3->6 6
3 q 3->6->5 5
4 r 4 4
4 r 4->2 2
4 r 4->2->1 1
5 s 5 5
5 s 5->6 6
5 s 5->6->3 3
6 q s 6 6
6 q s 6->3 3
6 q s 6->5 5
18 rows selected
这篇关于SQL组合多个标识符为重复的记录创建组ID的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!