通过交集合并2个数据集 [英] Joining 2 data sets via intersection
问题描述
我以前发布了此问题的一个版本,但是在使用这种略有不同的数据格式时,我一直在努力寻找答案,因此我在此方面再次伸出援手.
I posted a version of this question previously but am struggling to get that answer working with this slightly different format of data working...hence i am reaching out again on this.
我有以下数据集(请注意,读取下面数据的方式是ID1,福特,具有以下属性和值A:B,B:C和C:D
I have the following set of data (note the way to read the data below is ID1, Ford, has the following attributes and values A:B, B:C and C:D
+------------------------------------------------+
| ID NAME Attribute Attribute Value |
+------------------------------------------------+
| 1 Ford A B |
| 1 Ford B C |
| 1 Ford C D |
| 2 BMW A B |
| 2 BMW C D |
| 2 BMW F G |
| 3 TESLA Z Y |
| 3 TESLA E F |
| 3 TESLA A B |
+------------------------------------------------+
我基本上想将表中的每个ID与其他ID进行比较,然后输出结果.第一个比较是将ID 1与2和3进行比较 并进行比较,查看匹配项在哪里,哪些不匹配.
I would basically like to compare each ID in the table against the rest and output a result. The first comparison would be to check ID 1 against 2 and 3 and do a comparison and see where the matches are, and where they are not.
输出(仅完成第一次比较vs仅记录1条):
ouput (1st comparison done only vs only 1 record):
+----------------------------------------------------------------------------+
| BaseID BaseNAME Target ID TargetName MatchedOn Baseonly Tgtonly |
+----------------------------------------------------------------------------+
| 1 Ford 2 BMW A:B;C:D B:C F:G |
+----------------------------------------------------------------------------+
以前,一个好心的人帮助我实现了笛卡尔积,但是数据的格式略有不同-但这太慢了.所以我想看看是否有人对达到预期结果的最佳方法有任何想法?
Previously a kind individual helped me implement a Cartesian product but the data was in a slightly different format - but it was a bit too slow. So i wanted to see if anyone had any ideas on the best way to go about getting to the desired result?
推荐答案
这可能会更快:
with
t1 as (select distinct a.id ia, a.name na, b.id ib, b.name nb
from t a join t b on a.id < b.id),
t2 as (
select ia, na, ib, nb,
cast(multiset(select attr||':'||val from t where id = ia intersect
select attr||':'||val from t where id = ib )
as sys.odcivarchar2list) a1,
cast(multiset(select attr||':'||val from t where id = ia minus
select attr||':'||val from t where id = ib )
as sys.odcivarchar2list) a2,
cast(multiset(select attr||':'||val from t where id = ib minus
select attr||':'||val from t where id = ia )
as sys.odcivarchar2list) a3
from t1)
select ia, na, ib, nb,
(select listagg(column_value, ';') within group (order by null) from table(t2.a1)) l1,
(select listagg(column_value, ';') within group (order by null) from table(t2.a2)) l2,
(select listagg(column_value, ';') within group (order by null) from table(t2.a3)) l3
from t2
order by ia, ib
- 子查询
t1
创建将要比较的成对的汽车" -
t2
为每对收集相同或不同属性的集合.sys.odcivarchar2list
是内置类型,仅是字符串表 -
最终查询将集合更改为字符串列表.结果:
- subquery
t1
creates pairs of "cars" we will compare t2
gathers for each pair collections of common or different attributes.sys.odcivarchar2list
is built-in type, just table of stringfinal query changes collections into list of strings. Result:
IA NA IB NB L1 L2 L3
-- ------------ --- ----- --------- ------------ -----------
1 Ford 2 BMW A:B;C:D B:C F:G
1 Ford 3 TESLA A:B B:C;C:D E:F;Z:Y
2 BMW 3 TESLA A:B C:D;F:G E:F;Z:Y
我希望它可以更快,因为我们没有使用任何用户定义的函数,并且将操作次数减至最少.
I hope this to be faster, because we're not using any user defined function and number of operations is minimized.
替代方法是使用类似此功能的内容:
The alternative is to use something like this function:
-- find different or common attributes
create or replace function dca(i1 in number, i2 in number, op in char)
return varchar2 is
ret varchar2(1000);
begin
case op
when 'M' then -- minus
select listagg(attr||':'||val, ';') within group (order by null) into ret
from (select attr, val from t where id = i1 minus
select attr, val from t where id = i2 );
when 'I' then -- intersect
select listagg(attr||':'||val, ';') within group (order by null) into ret
from (select attr, val from t where id = i1 intersect
select attr, val from t where id = i2 );
end case;
return ret;
end;
在此查询中:
select ia, na, ib, nb,
dca(ia, ib, 'I') ab, dca(ia, ib, 'M') a_b, dca(ib, ia, 'M') b_a
from (select distinct a.id ia, a.name na, b.id ib, b.name nb
from t a join t b on a.id < b.id)
order by ia, ib;
它也可以工作,但是这是UDF,在查询中表现较差.
It works too, but this is UDF which performs worse in queries.
这篇关于通过交集合并2个数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!