在 Spark 与 Spark Broadcast 变量中处理 Hive 查找表 [英] Processing Hive Lookup tables in Spark vs Spark Broadcast variables
问题描述
我有两个数据集名称 dataset1
和 dataset2
和 dataset1
就像
i have a two data sets names dataset1
and dataset2
and dataset1
is like
empid empame
101 john
102 kevin
和dataset2
就像
empid empmarks empaddress
101 75 LA
102 69 NY
dataset2
将非常庞大,我需要对这两个数据集进行一些操作,并需要从以上两个 dataset
中获取结果.据我所知,现在我有两种选择来处理这些数据集:
The dataset2
will be very huge and i need to process some operations on these two datasets and need to get results from above two datasets
.
As of my knowledge, now i have two options to process these datasets:
1.将dataset1(大小较小)存储为<strong>hive查找表,并且必须通过Spark
1.Store dataset1(which is lesser in size) as hive lookup table and have to process them through Spark
2.通过使用Spark Broadcast Variables,我们可以处理这些dataset
.
2.By using Spark Broadcast Variables we can process these dataset
.
任何人请告诉我哪个是更好的选择.
Anyone please suggest me which one is the better option.
推荐答案
这应该比上面提到的 2 个选项更好.
This should be better option than those 2 options mentioned.
既然你有公共密钥,你就可以进行内连接.
since you have common key you can do inner join.
dataset2.join(dataset1, Seq("empid"), "inner").show()
您也可以像这样使用 broadcast
功能/提示.这意味着您告诉框架应该向每个执行程序广播小数据帧,即 dataset1.
you can use broadcast
function/hint like this as well. which means you are telling framework that small dataframe i.e dataset1 should be broadcasted to every executor.
import org.apache.spark.sql.functions.broadcast
dataset2.join(broadcast(dataset1), Seq("empid"), "inner").show()
另请查看更多详情..
DataFrame 连接优化 - 广播哈希连接 广播连接的工作方式.
DataFrame join optimization - Broadcast Hash Join how broadcast joins will work.
What-is-the-maximum-size-for-a-broadcast-object-in-spark
这篇关于在 Spark 与 Spark Broadcast 变量中处理 Hive 查找表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!