高效查找Hive / Spark中的bigdata表的所有相关子范围 [英] Finding efficiently all relevant sub ranges for bigdata tables in Hive/ Spark
本文介绍了高效查找Hive / Spark中的bigdata表的所有相关子范围的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
遵循这个问题,我想问。
我有两个表:
第一个表 - MajorRange
Following this question, I would like to ask.
I have 2 tables:
The first table - MajorRange
row | From | To | Group ....
-----|--------|---------|---------
1 | 1200 | 1500 | A
2 | 2200 | 2700 | B
3 | 1700 | 1900 | C
4 | 2100 | 2150 | D
...
第二张表 - SubRange
The second table - SubRange
row | From | To | Group ....
-----|--------|---------|---------
1 | 1208 | 1300 | E
2 | 1400 | 1600 | F
3 | 1700 | 2100 | G
4 | 2100 | 2500 | H
...
输出表应该是所有的 SubRange
与 MajorRange
组重叠的组。在下面的例子中,结果表是:
The output table should be the all the SubRange
groups who has overlap over the MajorRange
groups. In the following example the result table is:
row | Major | Sub |
-----|--------|------|-
1 | A | E |
2 | A | F |
3 | B | H |
4 | C | G |
5 | D | H |
如果 Major
不会出现。
这两个表都是大数据表。我怎样才能以最有效的方式使用Hive / Spark?
In case there is no overlapping between the ranges the Major
will not appear.
Both tables are big data tables.How can I do it using Hive/ Spark in most efficient way?
推荐答案
用 spark
,也许是一个非equi连接?
With spark
, maybe a non equi join like this?
val join_expr = major_range("From") < sub_range("To") && major_range("To") > sub_range("From")
(major_range.join(sub_range, join_expr)
.select(
monotonically_increasing_id().as("row"),
major_range("Group").as("Major"),
sub_range("Group").as("Sub")
)
).show
+---+-----+---+
|row|Major|Sub|
+---+-----+---+
| 0| A| E|
| 1| A| F|
| 2| B| H|
| 3| C| G|
| 4| D| H|
+---+-----+---+
这篇关于高效查找Hive / Spark中的bigdata表的所有相关子范围的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文