左外部连接结果在蜂巢上变得更大 [英] left outer join results become bigger on hive

查看:97
本文介绍了左外部连接结果在蜂巢上变得更大的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

全部,我想从此查询中获得可靠的结果:

All, I want to get reliable results from this query:

SELECT ..

FROM
(
SELECT 
 CO_CODE,
REP.cua cua,
PRD.PRODUCT_DESC, 
REGEXP_EXTRACT(B.rfbbn,'^(?:[^*]*\\*){2}([^*]*)',1) cllt,
NVL(CCY_bbce,0) bbce, 
B.TYPE, 
A.conn_keyy

FROM 
(
SELECT conn_keyy , ext_date FROM 
(tablee.aa) A
)aaxyz 
WHERE flag = 'Y'
)A

LEFT OUTER JOIN 
tablee.B
ON A.conn_keyy = B.conn_keyy

LEFT OUTER JOIN (SELECT DISTINCT * FROM tablee.cc) CPLCUR
ON CPLCUR.conn_keyy = A.conn_keyy
AND CPLCUR.cllt = REGEXP_EXTRACT(B.rfbbn,'^(?:[^*]*\\*){2}([^*]*)',1)
AND CPLCUR.dtdt = '1999' 

LEFT OUTER JOIN (SELECT DISTINCT * FROM tablee.dd) CPLBAL
ON CPLBAL.conn_keyy = A.conn_keyy
AND CPLBAL.SEQUENCE = CPLCUR.SEQUENCE
AND CPLBAL.dtdt = '1999' 

LEFT OUTER JOIN (SELECT DISTINCT * FROM tablee.ee) CPLCCY
ON CPLCCY.conn_keyy = A.conn_keyy
AND CPLCCY.SEQUENCE = CPLCUR.SEQUENCE
AND CPLCCY.dtdt = '1999' 

LEFT OUTER JOIN (SELECT DISTINCT * FROM tablee.ff) CPLMOV
ON CPLMOV.conn_keyy = A.conn_keyy
AND CPLMOV.SEQUENCE = CPLCUR.SEQUENCE
AND CPLMOV.dtdt = '1999' 

LEFT OUTER JOIN
 (tablee.REP)REP 
ON REP.relino = B.lnido

LEFT OUTER JOIN tablee.P PRD
ON PRD.PRODUCT_CODE = REGEXP_EXTRACT(A.conn_keyy,'[.]([^.]+)',1)
AND PRD.dtdt = '1999'

WHERE B.lnido LIKE 'PLCONS1%'
) rrvv;

fyi,A的选择计数(*)约为60,000

fyi, the select count (*) of A is around 60,000

我只是想知道为什么我的查询结果变成15亿..我错过了什么?我操作左外联接时出了什么问题?

I just wondering why my query results turns 1.5 billions.. What did I miss? What went wrong when i operate the left-outer-join?

推荐答案

如果第二个表中的联接键不是唯一的,并且两个表中的联接键都不是唯一的,则联接可以重复行将产生更多的重复项.

Join can duplicate rows if the join key is not unique in second table and if join key is not unique in both tables, it will produce much more duplicates.

例如:

with 

A as (
select 1 key, 'one' name
union all
select 1 key, 'two' name
),

B as (
select 1 key, 'one' name
union all
select 1 key, 'two' name
)

select *
  from A left join B on A.key=B.key

结果为四行,每个表仅包含两行:

Result in four rows and each table contain only two:

a.key   a.name  b.key   b.name
1       one     1       one
1       one     1       two
1       two     1       one
1       two     1       two

如何查找重复的密钥:

select B.conn_keyy, count(*) cnt 
  from tablee.B
group by B.conn_keyy
 having count(*)>1
order by cnt desc limit 100;

检查要与之连接的每个表并确定可以执行的操作:应用过滤,不同或添加更多的联接键以进行联接(一对一或零)或(许多一或零)

Check every table you are joining with and decide what you can do: apply filtering, distinct or add more join keys to make join (one to one or zero) or (many to one or zero)

这篇关于左外部连接结果在蜂巢上变得更大的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆