SAS合并和全外连接之间的区别 [英] Difference between SAS merge and full outer join
问题描述
表t1:
person |访问| code_num1 | code_desc1
1 1 100 OTD
1 2 101 SED
2 3 102 CHM
3 4 103 OTD
3 4 103 OTD
4 5 101 SED
表t2:
person |访问| code_num2 | code_desc2
1 1 104 DME
1 6 104 DME
3 4 103 OTD
3 4 103 OTD
3 7 103 OTD
4 5 104 DME
我有以下SAS代码可以按人物合并两张表t1和t2并访问:
DATA t3;
MERGE t1 t2;
BY人员访问;
RUN;
产生以下输出:
person |访问| code_num1 | code_desc1 | code_num2 | code_desc2
1 1 100 OTD 104 DME
1 2 101 SED
1 6 104 DME
2 3 102 CHM
3 4 103 OTD 103 OTD
3 4 103 OTD 103 OTD
3 7 103 OTD
4 5 101 SED 104 DME
我想在一个hive查询中复制它,并尝试使用完整的外连接:
create table t3 as
选择a.person为null时的情况,然后b.person else a.person以人身份结束,
当a.visit为null时,则b.visit else其他a.visit以访问结束,
a .code_num1,a.code_desc1,b.code_num2,b.code_desc2 $ t1 b
a
完全外连接t2 b
a.person = b.perso n和a.visit = b.visit
产生表格:
person |访问| code_num1 | code_desc1 | code_num2 | code_desc2
1 1 100 OTD 104 DME
1 2 101 SED null null
1 6 null null 104 DME
2 3 102 CHM null null
3 4 103 OTD 103 OTD
3 4 103 OTD 103 OTD
3 4 103 OTD 103 OTD
3 4 103 OTD 103 OTD
3 7 null null 103 OTD
4 5 101 SED 104 DME
这与SAS几乎相同,但我们有2个额外的行(person = 3,访问= 4)。我认为这是因为配置单元将一个表中的每一行与另一个表中的两行相匹配,在t3中产生4行,而SAS不会。有关如何让我的查询匹配SAS合并输出的建议?
解决方案如果您合并两个数据集和它们具有相同名称的变量(除了变量),那么来自第二个数据集的变量将超过第一个数据集中具有相同名称的变量。所以你的sas代码创建一个覆盖数据集。完整的外连接不会这样做。
在我看来,如果您先剔除右侧表格,然后进行完整的外部连接,则应该在配置单元中获得等效表格。正如乔指出的那样,我不认为需要这种情况。只需在键值上进行连接即可:
创建表t3作为
选择合并(a.person,b。 (a.visit,b.visit)as visit b $ b,a.code_num1
,a.code_desc1
,b.code_num2
, b.code_desc2
from
(select * from t1)a
全部外部连接
(选择人员,访问,code_num2,code_desc2
按人员分组,访问,code_num2 ,来自t2的code_desc2)a.person = b.person和a.visit = b.visit
; b
;
我目前无法测试此代码,因此请务必对其进行测试。祝你好运。
Table t1:
person | visit | code_num1 | code_desc1
1 1 100 OTD
1 2 101 SED
2 3 102 CHM
3 4 103 OTD
3 4 103 OTD
4 5 101 SED
Table t2:
person | visit | code_num2 | code_desc2
1 1 104 DME
1 6 104 DME
3 4 103 OTD
3 4 103 OTD
3 7 103 OTD
4 5 104 DME
I have the following SAS code that merges the two tables t1 and t2 by person and visit:
DATA t3;
MERGE t1 t2;
BY person visit;
RUN;
Which produces the following output:
person | visit | code_num1 | code_desc1 |code_num2 | code_desc2
1 1 100 OTD 104 DME
1 2 101 SED
1 6 104 DME
2 3 102 CHM
3 4 103 OTD 103 OTD
3 4 103 OTD 103 OTD
3 7 103 OTD
4 5 101 SED 104 DME
I want to replicate this in a hive query, and tried using a full outer join:
create table t3 as
select case when a.person is null then b.person else a.person end as person,
case when a.visit is null then b.visit else a.visit end as visit,
a.code_num1, a.code_desc1, b.code_num2, b.code_desc2
from t1 a
full outer join t2 b
on a.person=b.person and a.visit=b.visit
Which yields the table:
person | visit | code_num1 | code_desc1 |code_num2 | code_desc2
1 1 100 OTD 104 DME
1 2 101 SED null null
1 6 null null 104 DME
2 3 102 CHM null null
3 4 103 OTD 103 OTD
3 4 103 OTD 103 OTD
3 4 103 OTD 103 OTD
3 4 103 OTD 103 OTD
3 7 null null 103 OTD
4 5 101 SED 104 DME
Which is almost the same as SAS, but we have 2 extra rows for (person=3, visit=4). I assume this is because hive is matching each row in one table with two rows in the other, producing the 4 rows in t3, whereas SAS does not. Any suggestions on how I could get my query to match the output of the SAS merge?
If you merge two data sets and they have variables with the same names (besides the by variables) then variables from the second data set will overwwrite any variables having the same name in the first data set. So your sas code creates a overlaid dataset. A full outer join does not do this.
It seems to me if you first dedupe the right side table then do a full outer join you should get the equivalent table in hive. I don't see a need for the case when statements either as Joe pointed out. Just do a join on the key values:
create table t3 as
select coalesce(a.person, b.person) as person
, coalesce(a.visit, b.visit) as visit
, a.code_num1
, a.code_desc1
, b.code_num2
, b.code_desc2
from
(select * from t1) a
full outer join
(select person, visit, code_num2, code_desc2
group by person, visit, code_num2, code_desc2 from t2) b
on a.person=b.person and a.visit=b.visit
;
I can't test this code currently so be sure to test it. Good luck.
这篇关于SAS合并和全外连接之间的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!