在 Spark 与 Spark Broadcast 变量中处理 Hive 查找表 [英] Processing Hive Lookup tables in Spark vs Spark Broadcast variables

查看:37
本文介绍了在 Spark 与 Spark Broadcast 变量中处理 Hive 查找表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个数据集名称 dataset1dataset2dataset1 就像

i have a two data sets names dataset1 and dataset2 and dataset1 is like

empid  empame
101    john
102    kevin

dataset2就像

empid  empmarks  empaddress
101      75        LA
102      69        NY

dataset2 将非常庞大,我需要对这两个数据集进行一些操作,并需要从以上两个 dataset 中获取结果.据我所知,现在我有两种选择来处理这些数据集:

The dataset2 will be very huge and i need to process some operations on these two datasets and need to get results from above two datasets. As of my knowledge, now i have two options to process these datasets:

1.将dataset1(大小较小)存储为<​​strong>hive查找表,并且必须通过Spark

1.Store dataset1(which is lesser in size) as hive lookup table and have to process them through Spark

2.通过使用Spark Broadcast Variables,我们可以处理这些dataset.

2.By using Spark Broadcast Variables we can process these dataset.

任何人请告诉我哪个是更好的选择.

Anyone please suggest me which one is the better option.

推荐答案

这应该比上面提到的 2 个选项更好.

This should be better option than those 2 options mentioned.

既然你有公共密钥,你就可以进行内连接.

since you have common key you can do inner join.

dataset2.join(dataset1, Seq("empid"), "inner").show()

您也可以像这样使用 broadcast 功能/提示.这意味着您告诉框架应该向每个执行程序广播小数据帧,即 dataset1.

you can use broadcast function/hint like this as well. which means you are telling framework that small dataframe i.e dataset1 should be broadcasted to every executor.

import org.apache.spark.sql.functions.broadcast
dataset2.join(broadcast(dataset1), Seq("empid"), "inner").show()

另请查看更多详情..

  • DataFrame join optimization - Broadcast Hash Join how broadcast joins will work.

What-is-the-maximum-size-for-a-broadcast-object-in-spark

这篇关于在 Spark 与 Spark Broadcast 变量中处理 Hive 查找表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆