使用范围连接减少记录数量 [英] Reduction in the number of records using range join
问题描述
继我的问题
我有以下表格第一个(范围)包括值范围和附加列:
row |从|为|国家......
----- | -------- | --------- | ---------
1 | 1200 | 1500 |
2 | 2200 | 2700 |
3 | 1700 | 1900 |
4 | 2100 | 2150 |
...
From
和至
都是bigint,并且是独占的。范围表包括1.8M记录。附加表(值)包含2.7M记录,如下所示:
row |值|更多专栏....
-------- | -------- | ----------------
1 | 1777 |
2 | 2122 |
3 | 1832 |
4 | 1340 |
...
我想创建一个表格,如下所示:
row |值|从|为|更多专栏....
-------- | -------- | -------- | ------- | ---
1 | 1777 | 1700 | 1900 |
2 | 2122 | 2100 | 2150 |
3 | 1832 | 1700 | 1900 |
4 | 1340 | 1200 | 1500 |
...
我在以下代码中使用了左外连接:
set n = 1000;
从val v $ b $中选择v.id
,v.val
,r.from_val
,r.to_val
b离开外部连接
(选择r。*
,floor(from_val / $ {hiveconf:n})+ pe.i as match_val
from val_range r
横向视图posexplode
(
split
(
space
(
cast
(
floor(to_val / $ {hiveconf:n})
- floor(from_val / $ {hiveconf:n})
as int
)
)
,''
)
)pe as i,x
)r
(v.val / $ {hiveconf:n})=
r.match_val
r.from_val和r.to_val之间的v.val
order by v.id
;
然而,新表记录的数量大大减少M.如果我使用左外连接
,怎么会这样呢?假如我们有一个 v.id
set n = 1000;
选择v.id
,r.from_val
,r.to_val
从val v
left join选择v.id
,r.from_val
,r.to_val
从val v
加入(...)r
(v.val / $ {hiveconf:n})=
r.match_val
r.from_val和r.to_val之间的v.val
)r
on r.id =
v.id
by v.id
至于OP请求,以下是完整的查询:
set n = 1000;
选择v.id
,r.from_val
,r.to_val
从val v
left join选择v.id
,r.from_val
,r.to_val
从val v
加入(选择r。*
, (from_val / $ {hiveconf:n})+ pe.i as match_val
from val_range r
横向视图posexplode
(
split
空间
(
cast
(
floor(to_val / $ {hiveconf:n})
- floor(from_val / $ {hiveconf:n})
作为整数
)
)
,'
)
)pe作为i,x
)r
(v.val / $ {hiveconf:n})=
r.match_val
r.from_val和r.to_val之间的v.val
)r
r.id =
v.id
order by v。 id
Following the my question I have the following tables the first (Range) includes range of values and additional columns:
row | From | To | Country ....
-----|--------|---------|---------
1 | 1200 | 1500 |
2 | 2200 | 2700 |
3 | 1700 | 1900 |
4 | 2100 | 2150 |
...
The From
and To
are bigint and are exclusive. The Range table includes 1.8M records. Additional table (Values) contains 2.7M records and looks like:
row | Value | More columns....
--------|--------|----------------
1 | 1777 |
2 | 2122 |
3 | 1832 |
4 | 1340 |
...
I would like to create one table as followed:
row | Value | From | To | More columns....
--------|--------|--------|-------|---
1 | 1777 | 1700 | 1900 |
2 | 2122 | 2100 | 2150 |
3 | 1832 | 1700 | 1900 |
4 | 1340 | 1200 | 1500 |
...
I used the left outer join in the following code:
set n=1000;
select v.id
,v.val
,r.from_val
,r.to_val
from val v
left outer join
(select r.*
,floor(from_val/${hiveconf:n}) + pe.i as match_val
from val_range r
lateral view posexplode
(
split
(
space
(
cast
(
floor(to_val/${hiveconf:n})
- floor(from_val/${hiveconf:n})
as int
)
)
,' '
)
) pe as i,x
) r
on floor(v.val/${hiveconf:n}) =
r.match_val
where v.val between r.from_val and r.to_val
order by v.id
;
However there is a vast reduction in the number of records of the new table ~31k records out of 2.7M. How can it be if I use the left outer join
? How can I fix it?
Assuming we have a v.id
set n=1000;
select v.id
,r.from_val
,r.to_val
from val v
left join (select v.id
,r.from_val
,r.to_val
from val v
join (...) r
on floor(v.val/${hiveconf:n}) =
r.match_val
where v.val between r.from_val and r.to_val
) r
on r.id =
v.id
order by v.id
As for the OP request, here is the full query:
set n=1000;
select v.id
,r.from_val
,r.to_val
from val v
left join (select v.id
,r.from_val
,r.to_val
from val v
join (select r.*
,floor(from_val/${hiveconf:n}) + pe.i as match_val
from val_range r
lateral view posexplode
(
split
(
space
(
cast
(
floor(to_val/${hiveconf:n})
- floor(from_val/${hiveconf:n})
as int
)
)
,' '
)
) pe as i,x
) r
on floor(v.val/${hiveconf:n}) =
r.match_val
where v.val between r.from_val and r.to_val
) r
on r.id =
v.id
order by v.id
这篇关于使用范围连接减少记录数量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!