通过交集合并2个数据集 [英] Joining 2 data sets via intersection

查看:77
本文介绍了通过交集合并2个数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我以前发布了此问题的一个版本,但是在使用这种略有不同的数据格式时,我一直在努力寻找答案,因此我在此方面再次伸出援手.

I posted a version of this question previously but am struggling to get that answer working with this slightly different format of data working...hence i am reaching out again on this.

我有以下数据集(请注意,读取下面数据的方式是ID1,福特,具有以下属性和值A:B,B:C和C:D

I have the following set of data (note the way to read the data below is ID1, Ford, has the following attributes and values A:B, B:C and C:D

+------------------------------------------------+
| ID     NAME     Attribute      Attribute Value |
+------------------------------------------------+
| 1      Ford         A                  B       |
| 1      Ford         B                  C       |
| 1      Ford         C                  D       |
| 2      BMW          A                  B       |
| 2      BMW          C                  D       |
| 2      BMW          F                  G       |
| 3      TESLA        Z                  Y       |
| 3      TESLA        E                  F       |
| 3      TESLA        A                  B       |
+------------------------------------------------+

我基本上想将表中的每个ID与其他ID进行比较,然后输出结果.第一个比较是将ID 1与2和3进行比较 并进行比较,查看匹配项在哪里,哪些不匹配.

I would basically like to compare each ID in the table against the rest and output a result. The first comparison would be to check ID 1 against 2 and 3 and do a comparison and see where the matches are, and where they are not.

输出(仅完成第一次比较vs仅记录1条):

ouput (1st comparison done only vs only 1 record):

+----------------------------------------------------------------------------+
| BaseID  BaseNAME   Target ID   TargetName    MatchedOn    Baseonly Tgtonly |
+----------------------------------------------------------------------------+
| 1        Ford         2          BMW           A:B;C:D     B:C     F:G     |
+----------------------------------------------------------------------------+

以前,一个好心的人帮助我实现了笛卡尔积,但是数据的格式略有不同-但这太慢了.所以我想看看是否有人对达到预期结果的最佳方法有任何想法?

Previously a kind individual helped me implement a Cartesian product but the data was in a slightly different format - but it was a bit too slow. So i wanted to see if anyone had any ideas on the best way to go about getting to the desired result?

推荐答案

这可能会更快:

with 
  t1 as (select distinct a.id ia, a.name na, b.id ib, b.name nb 
           from t a join t b on a.id < b.id),
  t2 as (
    select ia, na, ib, nb, 
           cast(multiset(select attr||':'||val from t where id = ia intersect 
                         select attr||':'||val from t where id = ib ) 
                as sys.odcivarchar2list) a1, 
           cast(multiset(select attr||':'||val from t where id = ia minus 
                         select attr||':'||val from t where id = ib ) 
                as sys.odcivarchar2list) a2, 
           cast(multiset(select attr||':'||val from t where id = ib minus 
                         select attr||':'||val from t where id = ia ) 
                as sys.odcivarchar2list) a3 
      from t1)
select ia, na, ib, nb, 
       (select listagg(column_value, ';') within group (order by null) from table(t2.a1)) l1,
       (select listagg(column_value, ';') within group (order by null) from table(t2.a2)) l2,
       (select listagg(column_value, ';') within group (order by null) from table(t2.a3)) l3
  from t2
  order by ia, ib

dbfiddle演示

  • 子查询t1创建将要比较的成对的汽车"
  • t2为每对收集相同或不同属性的集合. sys.odcivarchar2list是内置类型,仅是字符串表
  • 最终查询将集合更改为字符串列表.结果:

  • subquery t1 creates pairs of "cars" we will compare
  • t2 gathers for each pair collections of common or different attributes. sys.odcivarchar2list is built-in type, just table of string
  • final query changes collections into list of strings. Result:

IA NA            IB NB    L1        L2           L3
-- ------------ --- ----- --------- ------------ -----------
 1 Ford           2 BMW   A:B;C:D   B:C          F:G
 1 Ford           3 TESLA A:B       B:C;C:D      E:F;Z:Y
 2 BMW            3 TESLA A:B       C:D;F:G      E:F;Z:Y

我希望它可以更快,因为我们没有使用任何用户定义的函数,并且将操作次数减至最少.

I hope this to be faster, because we're not using any user defined function and number of operations is minimized.

替代方法是使用类似此功能的内容:

The alternative is to use something like this function:

-- find different or common attributes
create or replace function dca(i1 in number, i2 in number, op in char) 
  return varchar2 is 
  ret varchar2(1000);
begin 
  case op 
    when 'M' then -- minus
      select listagg(attr||':'||val, ';') within group (order by null) into ret
        from (select attr, val from t where id = i1 minus 
              select attr, val from t where id = i2 );
    when 'I' then -- intersect
      select listagg(attr||':'||val, ';') within group (order by null) into ret
        from (select attr, val from t where id = i1 intersect 
              select attr, val from t where id = i2 );
  end case;
  return ret;
end;

在此查询中:

select ia, na, ib, nb, 
       dca(ia, ib, 'I') ab, dca(ia, ib, 'M') a_b, dca(ib, ia, 'M') b_a 
  from (select distinct a.id ia, a.name na, b.id ib, b.name nb 
          from t a join t b on a.id < b.id)
  order by ia, ib;

它也可以工作,但是这是UDF,在查询中表现较差.

It works too, but this is UDF which performs worse in queries.

这篇关于通过交集合并2个数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆