使用黑斑羚通过范围连接表的有效方法 [英] Effective way to join tables by range using impala

查看:131
本文介绍了使用黑斑羚通过范围连接表的有效方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下表格( Range )包含值和额外列的范围:

I have the following tables the first (Range) includes range of values and additional columns:

row  | From   |  To     | Country ....
-----|--------|---------|---------
1    | 1200   |   1500  |
2    | 2200   |   2700  |
3    | 1700   |   1900  |
4    | 2100   |   2150  |
... 

From 都是 bigint 并且是独占的。 范围表格包含了1.8M条记录。
附加表()包含2.7M条记录,如下所示:

The From and Toare bigint and are exclusive. The Range table includes 1.8M records. Additional table (Values) contains 2.7M records and looks like:

 row     | Value  | More columns....
 --------|--------|----------------
    1    | 1777   |    
    2    | 2122   |    
    3    | 1832   |    
    4    | 1340   |    
    ... 

我想创建一个表格,如下所示:

I would like to create one table as followed:

row      | Value  | From   | To    | More columns....
 --------|--------|--------|-------|---
    1    | 1777   | 1700   | 1900  |
    2    | 2122   | 2100   | 2150  |   
    3    | 1832   | 1700   | 1900  |   
    4    | 1340   | 1200   | 1500  |   
    ... 

我用 BETWEEN 对于上述任务,但查询永远不会结束:

I used BETWEEN for the above task, but the query never ends:

VALUES.VALUE between RANGE.FROM and RANGE.TO

我需要在表分区或Impala中进行更改吗?

Is there a change I need to do in table partitions or in Impala?

推荐答案

以下解决方案的主要思想是用equi连接替换theta连接(非等连接)导致一个良好的分布+有效的本地连接算法。

The main idea of the following solution is to replace a theta join (non-equi join) with an equi join that will lead to a good distribution + efficient local join algorithm.

范围(-infinity,infinity)被分割为 n length。

范围表中的每个范围都与它相交的部分相关联。

The range (-infinity,infinity) is being split to section of n length.
Each range from the ranges table is being associated with the sections it intersects.

例如如果n = 1000,范围 [1652,3701] 将与 [1000,2000] 部分关联, [2000,3000) [3000,4000)(并且将有3个记录,每个部分为1)

e.g. given n=1000, the range [1652,3701] will be associated with the sections [1000,2000), [2000,3000) and [3000,4000) (and will have 3 records, 1 for each section)

               1652              3701
               |                 |
               -------------------

-------------------------------------------------------
|        |        |        |        |        |                
0        1000     2000     3000     4000     5000 

同样,值表中的值是与包含它的范围相关联,例如 2093 将与范围 [2000,3000)关联。

In the same manner a value from the values table is being associated to the range that contains it, e.g. 2093 will be associated with the range [2000,3000).

2个表格之间的连接将显示表示该部分的值,例如 [1652,3701] 2093 将在 [2000, 3000)

The join between the 2 tables is going to be on the value that represents the section, e.g. [1652,3701] and 2093 are going to be joined on the section [2000,3000)

create table val_range (id int,from_val bigint,to_val bigint);

insert into val_range values
    (1,1200,1500)
   ,(2,2200,2700)
   ,(3,1700,1900)
   ,(4,2100,2150)
;   

create table val (id int,val bigint);

insert into val values
    (1,1777)    
   ,(2,2122)    
   ,(3,1832)    
   ,(4,1340)
;   







set n=1000;

select      v.id
           ,v.val
           ,r.from_val
           ,r.to_val

from       (select  r.*
                   ,floor(from_val/${hiveconf:n}) + pe.i    as match_val

            from    val_range r
                    lateral view    posexplode
                                    (
                                        split
                                        (
                                            space
                                            (
                                                cast
                                                (
                                                    floor(to_val/${hiveconf:n}) 
                                                  - floor(from_val/${hiveconf:n}) 

                                                    as int
                                                )
                                            )
                                           ,' '
                                        )
                                    ) pe as i,x
            ) r

            join    val v

            on      floor(v.val/${hiveconf:n})    =
                    r.match_val

where       v.val between r.from_val and r.to_val

order by    v.id        
;







+------+-------+------------+----------+
| v.id | v.val | r.from_val | r.to_val |
+------+-------+------------+----------+
|    1 |  1777 |       1700 |     1900 |
|    2 |  2122 |       2100 |     2150 |
|    3 |  1832 |       1700 |     1900 |
|    4 |  1340 |       1200 |     1500 |
+------+-------+------------+----------+

这篇关于使用黑斑羚通过范围连接表的有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆