使用范围连接减少记录数量 [英] Reduction in the number of records using range join

查看:255
本文介绍了使用范围连接减少记录数量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的问题
我有以下表格第一个(范围)包括值范围和附加列:

  row |从|为|国家...... 
----- | -------- | --------- | ---------
1 | 1200 | 1500 |
2 | 2200 | 2700 |
3 | 1700 | 1900 |
4 | 2100 | 2150 |
...

From 都是bigint,并且是独占的。范围表包括1.8M记录。附加表(值)包含2.7M记录,如下所示:

  row |值|更多专栏.... 
-------- | -------- | ----------------
1 | 1777 |
2 | 2122 |
3 | 1832 |
4 | 1340 |
...

我想创建一个表格,如下所示:

  row |值|从|为|更多专栏.... 
-------- | -------- | -------- | ------- | ---
1 | 1777 | 1700 | 1900 |
2 | 2122 | 2100 | 2150 |
3 | 1832 | 1700 | 1900 |
4 | 1340 | 1200 | 1500 |
...

我在以下代码中使用了左外连接:

  set n = 1000; 

从val v $ b $中选择v.id
,v.val
,r.from_val
,r.to_val

b离开外部连接

(选择r。*
,floor(from_val / $ {hiveconf:n})+ pe.i as match_val

from val_range r
横向视图posexplode

split

space

cast

floor(to_val / $ {hiveconf:n})
- floor(from_val / $ {hiveconf:n})

as int


,''

)pe as i,x
)r



(v.val / $ {hiveconf:n})=
r.match_val

r.from_val和r.to_val之间的v.val

order by v.id
;

然而,新表记录的数量大大减少M.如果我使用左外连接,怎么会这样呢?假如我们有一个 v.id



  set n = 1000; 

选择v.id
,r.from_val
,r.to_val

从val v

left join选择v.id
,r.from_val
,r.to_val

从val v

加入(...)r
(v.val / $ {hiveconf:n})=
r.match_val

r.from_val和r.to_val之间的v.val
)r

on r.id =
v.id

by v.id

至于OP请求,以下是完整的查询:

  set n = 1000; 

选择v.id
,r.from_val
,r.to_val

从val v

left join选择v.id
,r.from_val
,r.to_val

从val v

加入(选择r。*
, (from_val / $ {hiveconf:n})+ pe.i as match_val

from val_range r
横向视图posexplode

split

空间

cast

floor(to_val / $ {hiveconf:n})
- floor(from_val / $ {hiveconf:n})

作为整数


,'

)pe作为i,x
)r

(v.val / $ {hiveconf:n})=
r.match_val

r.from_val和r.to_val之间的v.val
)r

r.id =
v.id

order by v。 id


Following the my question I have the following tables the first (Range) includes range of values and additional columns:

row  | From   |  To     | Country ....
-----|--------|---------|---------
1    | 1200   |   1500  |
2    | 2200   |   2700  |
3    | 1700   |   1900  |
4    | 2100   |   2150  |
... 

The From and To are bigint and are exclusive. The Range table includes 1.8M records. Additional table (Values) contains 2.7M records and looks like:

 row     | Value  | More columns....
 --------|--------|----------------
    1    | 1777   |    
    2    | 2122   |    
    3    | 1832   |    
    4    | 1340   |    
    ... 

I would like to create one table as followed:

row      | Value  | From   | To    | More columns....
 --------|--------|--------|-------|---
    1    | 1777   | 1700   | 1900  |
    2    | 2122   | 2100   | 2150  |   
    3    | 1832   | 1700   | 1900  |   
    4    | 1340   | 1200   | 1500  |   
    ... 

I used the left outer join in the following code:

set n=1000;

select      v.id
           ,v.val
           ,r.from_val
           ,r.to_val

from      val v
        left outer join    

 (select  r.*
                   ,floor(from_val/${hiveconf:n}) + pe.i    as match_val

            from    val_range r
                    lateral view    posexplode
                                    (
                                        split
                                        (
                                            space
                                            (
                                                cast
                                                (
                                                    floor(to_val/${hiveconf:n}) 
                                                  - floor(from_val/${hiveconf:n}) 

                                                    as int
                                                )
                                            )
                                           ,' '
                                        )
                                    ) pe as i,x
            ) r



            on      floor(v.val/${hiveconf:n})    =
                    r.match_val

where       v.val between r.from_val and r.to_val

order by    v.id       
;

However there is a vast reduction in the number of records of the new table ~31k records out of 2.7M. How can it be if I use the left outer join? How can I fix it?

解决方案

Assuming we have a v.id

set n=1000;

select      v.id
           ,r.from_val
           ,r.to_val

from                    val     v 

            left join  (select      v.id
                                   ,r.from_val
                                   ,r.to_val

                        from                val     v 

                                    join    (...)   r 

                                    on      floor(v.val/${hiveconf:n})    =
                                            r.match_val

                        where       v.val between r.from_val and r.to_val
                        ) r

            on          r.id    =
                        v.id

order by    v.id       

As for the OP request, here is the full query:

set n=1000;

select      v.id
           ,r.from_val
           ,r.to_val

from                    val     v 

            left join  (select      v.id
                                   ,r.from_val
                                   ,r.to_val

                        from                val     v 

                                    join   (select  r.*
                                                   ,floor(from_val/${hiveconf:n}) + pe.i    as match_val

                                            from    val_range r
                                                    lateral view    posexplode
                                                                    (
                                                                        split
                                                                        (
                                                                            space
                                                                            (
                                                                                cast
                                                                                (
                                                                                    floor(to_val/${hiveconf:n}) 
                                                                                  - floor(from_val/${hiveconf:n}) 

                                                                                    as int
                                                                                )
                                                                            )
                                                                           ,' '
                                                                        )
                                                                    ) pe as i,x
                                            ) r

                                    on      floor(v.val/${hiveconf:n})    =
                                            r.match_val

                        where       v.val between r.from_val and r.to_val
                        ) r

            on          r.id    =
                        v.id

order by    v.id       

这篇关于使用范围连接减少记录数量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆