SAS合并和全外连接之间的区别 [英] Difference between SAS merge and full outer join

查看:574
本文介绍了SAS合并和全外连接之间的区别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

表t1:

  person |访问| code_num1 | code_desc1 
1 1 100 OTD
1 2 101 SED
2 3 102 CHM
3 4 103 OTD
3 4 103 OTD
4 5 101 SED

表t2:

  person |访问| code_num2 | code_desc2 
1 1 104 DME
1 6 104 DME
3 4 103 OTD
3 4 103 OTD
3 7 103 OTD
4 5 104 DME

我有以下SAS代码可以按人物合并两张表t1和t2并访问:

  DATA t3; 
MERGE t1 t2;
BY人员访问;
RUN;

产生以下输出:

  person |访问| code_num1 | code_desc1 | code_num2 | code_desc2 
1 1 100 OTD 104 DME
1 2 101 SED
1 6 104 DME
2 3 102 CHM
3 4 103 OTD 103 OTD
3 4 103 OTD 103 OTD
3 7 103 OTD
4 5 101 SED 104 DME

我想在一个hive查询中复制它,并尝试使用完整的外连接:

  create table t3 as 
选择a.person为null时的情况,然后b.person else a.person以人身份结束,
当a.visit为null时,则b.visit else其他a.visit以访问结束,
a .code_num1,a.code_desc1,b.code_num2,b.code_desc2 $ t1 b
a
完全外连接t2 b
a.person = b.perso n和a.visit = b.visit

产生表格:

  person |访问| code_num1 | code_desc1 | code_num2 | code_desc2 
1 1 100 OTD 104 DME
1 2 101 SED null null
1 6 null null 104 DME
2 3 102 CHM null null
3 4 103 OTD 103 OTD
3 4 103 OTD 103 OTD
3 4 103 OTD 103 OTD
3 4 103 OTD 103 OTD
3 7 null null 103 OTD
4 5 101 SED 104 DME

这与SAS几乎相同,但我们有2个额外的行(person = 3,访问= 4)。我认为这是因为配置单元将一个表中的每一行与另一个表中的两行相匹配,在t3中产生4行,而SAS不会。有关如何让我的查询匹配SAS合并输出的建议?

解决方案

如果您合并两个数据集和它们具有相同名称的变量(除了变量),那么来自第二个数据集的变量将超过第一个数据集中具有相同名称的变量。所以你的sas代码创建一个覆盖数据集。完整的外连接不会这样做。



在我看来,如果您先剔除右侧表格,然后进行完整的外部连接,则应该在配置单元中获得等效表格。正如乔指出的那样,我不认为需要这种情况。只需在键值上进行连接即可:

 创建表t3作为
选择合并(a.person,b。 (a.visit,b.visit)as visit b $ b,a.code_num1
,a.code_desc1
,b.code_num2
, b.code_desc2
from
(select * from t1)a
全部外部连接
(选择人员,访问,code_num2,code_desc2
按人员分组,访问,code_num2 ,来自t2的code_desc2)a.person = b.person和a.visit = b.visit
; b
;

我目前无法测试此代码,因此请务必对其进行测试。祝你好运。


Table t1:

person | visit | code_num1 | code_desc1
     1       1         100         OTD
     1       2         101         SED
     2       3         102         CHM
     3       4         103         OTD 
     3       4         103         OTD
     4       5         101         SED

Table t2:

 person | visit | code_num2 | code_desc2
     1       1         104         DME
     1       6         104         DME
     3       4         103         OTD 
     3       4         103         OTD
     3       7         103         OTD
     4       5         104         DME

I have the following SAS code that merges the two tables t1 and t2 by person and visit:

DATA t3;
    MERGE t1 t2;
    BY person visit;
RUN;

Which produces the following output:

person | visit | code_num1 | code_desc1 |code_num2 | code_desc2
      1       1         100         OTD        104          DME
      1       2         101         SED   
      1       6                                104          DME           
      2       3         102         CHM 
      3       4         103         OTD        103          OTD
      3       4         103         OTD        103          OTD
      3       7                                103          OTD
      4       5         101         SED        104          DME

I want to replicate this in a hive query, and tried using a full outer join:

create table t3 as 
select case when a.person is null then b.person else a.person end as person,
       case when a.visit is null then b.visit else a.visit end as visit,
       a.code_num1, a.code_desc1, b.code_num2, b.code_desc2
       from t1 a 
       full outer join t2 b
       on a.person=b.person and a.visit=b.visit

Which yields the table:

person | visit | code_num1 | code_desc1 |code_num2 | code_desc2
      1       1         100         OTD        104          DME
      1       2         101         SED        null        null
      1       6         null        null       104          DME           
      2       3         102         CHM        null        null
      3       4         103         OTD        103          OTD
      3       4         103         OTD        103          OTD
      3       4         103         OTD        103          OTD
      3       4         103         OTD        103          OTD
      3       7         null        null       103          OTD
      4       5         101         SED        104          DME

Which is almost the same as SAS, but we have 2 extra rows for (person=3, visit=4). I assume this is because hive is matching each row in one table with two rows in the other, producing the 4 rows in t3, whereas SAS does not. Any suggestions on how I could get my query to match the output of the SAS merge?

解决方案

If you merge two data sets and they have variables with the same names (besides the by variables) then variables from the second data set will overwwrite any variables having the same name in the first data set. So your sas code creates a overlaid dataset. A full outer join does not do this.

It seems to me if you first dedupe the right side table then do a full outer join you should get the equivalent table in hive. I don't see a need for the case when statements either as Joe pointed out. Just do a join on the key values:

create table t3 as 
select  coalesce(a.person, b.person) as person
      , coalesce(a.visit, b.visit) as visit
      , a.code_num1
      , a.code_desc1
      , b.code_num2
      , b.code_desc2
   from 
   (select * from t1) a 
   full outer join
   (select person, visit, code_num2, code_desc2
       group by person, visit, code_num2, code_desc2 from t2) b
   on a.person=b.person and a.visit=b.visit
   ;

I can't test this code currently so be sure to test it. Good luck.

这篇关于SAS合并和全外连接之间的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆