在Spark和Spark Broadcast变量中处理Hive查找表 [英] Processing Hive Lookup tables in Spark vs Spark Broadcast variables
问题描述
我有两个数据集名称,分别为dataset1
和dataset2
,而dataset1
就像
i have a two data sets names dataset1
and dataset2
and dataset1
is like
empid empame
101 john
102 kevin
和dataset2
就像
empid empmarks empaddress
101 75 LA
102 69 NY
dataset2
将会非常庞大,我需要对这两个数据集进行一些运算,并且需要从上面两个datasets
中获取结果.
据我所知,现在我有两个选择来处理这些数据集:
The dataset2
will be very huge and i need to process some operations on these two datasets and need to get results from above two datasets
.
As of my knowledge, now i have two options to process these datasets:
1.将数据集1 (大小较小)存储为<strong>配置单元查询表,并且必须通过 Spark
1.Store dataset1(which is lesser in size) as hive lookup table and have to process them through Spark
2.通过使用火花广播变量,我们可以处理这些dataset
.
2.By using Spark Broadcast Variables we can process these dataset
.
任何人都可以建议我哪个是更好的选择.
Anyone please suggest me which one is the better option.
推荐答案
与上述两个选项相比,这应该是更好的选择.
This should be better option than those 2 options mentioned.
由于具有公共密钥,因此可以进行内部联接.
since you have common key you can do inner join.
dataset2.join(dataset1, Seq("empid"), "inner").show()
您也可以像这样使用broadcast
功能/提示.这意味着您要告诉框架,应将小型数据帧(即dataset1)广播给每个执行者.
you can use broadcast
function/hint like this as well. which means you are telling framework that small dataframe i.e dataset1 should be broadcasted to every executor.
import org.apache.spark.sql.functions.broadcast
dataset2.join(broadcast(dataset1), Seq("empid"), "inner").show()
也请参阅以获取更多详细信息.
Also Look at for more details..
-
DataFrame连接优化-广播哈希连接广播加入将如何工作.
DataFrame join optimization - Broadcast Hash Join how broadcast joins will work.
What-is-the-the-the-maximum-size-for-a-broadcast-object-in-spark
这篇关于在Spark和Spark Broadcast变量中处理Hive查找表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!