使用黑斑羚通过范围连接表的有效方法 [英] Effective way to join tables by range using impala
问题描述
我有以下表格( Range
)包含值和额外列的范围:
I have the following tables the first (Range
) includes range of values and additional columns:
row | From | To | Country ....
-----|--------|---------|---------
1 | 1200 | 1500 |
2 | 2200 | 2700 |
3 | 1700 | 1900 |
4 | 2100 | 2150 |
...
From
和至
都是 bigint
并且是独占的。 范围
表格包含了1.8M条记录。
附加表(值
)包含2.7M条记录,如下所示:
The From
and To
are bigint
and are exclusive. The Range
table includes 1.8M records.
Additional table (Values
) contains 2.7M records and looks like:
row | Value | More columns....
--------|--------|----------------
1 | 1777 |
2 | 2122 |
3 | 1832 |
4 | 1340 |
...
我想创建一个表格,如下所示:
I would like to create one table as followed:
row | Value | From | To | More columns....
--------|--------|--------|-------|---
1 | 1777 | 1700 | 1900 |
2 | 2122 | 2100 | 2150 |
3 | 1832 | 1700 | 1900 |
4 | 1340 | 1200 | 1500 |
...
我用 BETWEEN
对于上述任务,但查询永远不会结束:
I used BETWEEN
for the above task, but the query never ends:
VALUES.VALUE between RANGE.FROM and RANGE.TO
我需要在表分区或Impala中进行更改吗?
Is there a change I need to do in table partitions or in Impala?
推荐答案
以下解决方案的主要思想是用equi连接替换theta连接(非等连接)导致一个良好的分布+有效的本地连接算法。
The main idea of the following solution is to replace a theta join (non-equi join) with an equi join that will lead to a good distribution + efficient local join algorithm.
范围(-infinity,infinity)被分割为 n
length。
范围表中的每个范围都与它相交的部分相关联。
The range (-infinity,infinity) is being split to section of n
length.
Each range from the ranges table is being associated with the sections it intersects.
例如如果n = 1000,范围 [1652,3701]
将与 [2000,3000)
和 [3000,4000)
(并且将有3个记录,每个部分为1)
e.g. given n=1000, the range [1652,3701]
will be associated with the sections [1000,2000)
, [2000,3000)
and [3000,4000)
(and will have 3 records, 1 for each section)
1652 3701
| |
-------------------
-------------------------------------------------------
| | | | | |
0 1000 2000 3000 4000 5000
同样,值表中的值是与包含它的范围相关联,例如 2093
将与范围 [2000,3000)
关联。
In the same manner a value from the values table is being associated to the range that contains it, e.g. 2093
will be associated with the range [2000,3000)
.
2个表格之间的连接将显示表示该部分的值,例如 [1652,3701]
和 2093
将在 [2000, 3000)
The join between the 2 tables is going to be on the value that represents the section, e.g. [1652,3701]
and 2093
are going to be joined on the section [2000,3000)
create table val_range (id int,from_val bigint,to_val bigint);
insert into val_range values
(1,1200,1500)
,(2,2200,2700)
,(3,1700,1900)
,(4,2100,2150)
;
create table val (id int,val bigint);
insert into val values
(1,1777)
,(2,2122)
,(3,1832)
,(4,1340)
;
set n=1000;
select v.id
,v.val
,r.from_val
,r.to_val
from (select r.*
,floor(from_val/${hiveconf:n}) + pe.i as match_val
from val_range r
lateral view posexplode
(
split
(
space
(
cast
(
floor(to_val/${hiveconf:n})
- floor(from_val/${hiveconf:n})
as int
)
)
,' '
)
) pe as i,x
) r
join val v
on floor(v.val/${hiveconf:n}) =
r.match_val
where v.val between r.from_val and r.to_val
order by v.id
;
+------+-------+------------+----------+
| v.id | v.val | r.from_val | r.to_val |
+------+-------+------------+----------+
| 1 | 1777 | 1700 | 1900 |
| 2 | 2122 | 2100 | 2150 |
| 3 | 1832 | 1700 | 1900 |
| 4 | 1340 | 1200 | 1500 |
+------+-------+------------+----------+
这篇关于使用黑斑羚通过范围连接表的有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!