在Spark和Spark Broadcast变量中处理Hive查找表 [英] Processing Hive Lookup tables in Spark vs Spark Broadcast variables

查看:363
本文介绍了在Spark和Spark Broadcast变量中处理Hive查找表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个数据集名称,分别为dataset1dataset2,而dataset1就像

i have a two data sets names dataset1 and dataset2 and dataset1 is like

empid  empame
101    john
102    kevin

dataset2就像

empid  empmarks  empaddress
101      75        LA
102      69        NY

dataset2将会非常庞大​​,我需要对这两个数据集进行一些运算,并且需要从上面两个datasets中获取结果. 据我所知,现在我有两个选择来处理这些数据集:

The dataset2 will be very huge and i need to process some operations on these two datasets and need to get results from above two datasets. As of my knowledge, now i have two options to process these datasets:

1.将数据集1 (大小较小)存储为<​​strong>配置单元查询表,并且必须通过 Spark

1.Store dataset1(which is lesser in size) as hive lookup table and have to process them through Spark

2.通过使用火花广播变量,我们可以处理这些dataset.

2.By using Spark Broadcast Variables we can process these dataset.

任何人都可以建议我哪个是更好的选择.

Anyone please suggest me which one is the better option.

推荐答案

与上述两个选项相比,这应该是更好的选择.

This should be better option than those 2 options mentioned.

由于具有公共密钥,因此可以进行内部联接.

since you have common key you can do inner join.

dataset2.join(dataset1, Seq("empid"), "inner").show()

您也可以像这样使用broadcast功能/提示.这意味着您要告诉框架,应将小型数据帧(即dataset1)广播给每个执行者.

you can use broadcast function/hint like this as well. which means you are telling framework that small dataframe i.e dataset1 should be broadcasted to every executor.

import org.apache.spark.sql.functions.broadcast
dataset2.join(broadcast(dataset1), Seq("empid"), "inner").show()

也请参阅以获取更多详细信息.

Also Look at for more details..

  • DataFrame join optimization - Broadcast Hash Join how broadcast joins will work.

What-is-the-the-the-maximum-size-for-a-broadcast-object-in-spark

这篇关于在Spark和Spark Broadcast变量中处理Hive查找表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆