如何在Spark中访问广播的DataFrame [英] How to access broadcasted DataFrame in Spark

查看:872
本文介绍了如何在Spark中访问广播的DataFrame的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经创建了两个来自Hive表的数据框(PC_ITM和ITEM_SELL),并且尺寸很大,我正在使用它们 在SQL查询中经常通过注册为表来进行查询.但是由于它们很大,因此需要花费很多时间 为了获得查询结果.因此,我将它们保存为实木复合地板文件,然后读取它们并注册为临时表.但是,我仍然表现不佳,因此我广播了这些数据帧,然后注册为下表.

I have created two dataframes which are from Hive tables(PC_ITM and ITEM_SELL) and big in size and I am using those frequently in the SQL query by registering as table.But as those are big, it is taking much time to get the query result.So I have saved them as parquet file and then read them and registered as temporary table.But still I am not getting good performance so I have broadcasted those data-frames and then registered as tables as below.

PC_ITM_DF=sqlContext.parquetFile("path")
val PC_ITM_BC=sc.broadcast(PC_ITM_DF)
val PC_ITM_DF1=PC_ITM_BC
PC_ITM_DF1.registerAsTempTable("PC_ITM")

ITM_SELL_DF=sqlContext.parquetFile("path")
val ITM_SELL_BC=sc.broadcast(ITM_SELL_DF)
val ITM_SELL_DF1=ITM_SELL_BC.value
ITM_SELL_DF1.registerAsTempTable(ITM_SELL)


sqlContext.sql("JOIN Query").show

但是我仍然无法达到性能,这与不广播那些数据帧所花费的时间相同.

But still I cant achieve performance it is taking same time as when those data frames are not broadcasted.

有人能说出这是否是正确的广播和使用方法吗?

Can anyone tell if this is the right approach of broadcasting and using it?`

推荐答案

您实际上不需要访问"广播数据帧-您只需使用它,Spark就会在后台实现广播.

You don't really need to 'access' the broadcast dataframe - you just use it, and Spark will implement the broadcast under the hood. The broadcast function works nicely, and makes more sense that the sc.broadcast approach.

如果一次评估所有内容,可能很难理解花在哪里的时间.

It can be hard to understand where the time is being spent if you evaluate everything at once.

您可以将代码分成多个步骤.此处的关键是执行操作并在要在联接中使用它们之前持久保存要广播的数据帧.

You can break your code into steps. The key here will be performing an action and persisting the dataframes you want to broadcast before you use them in your join.

// load your dataframe
PC_ITM_DF=sqlContext.parquetFile("path")

// mark this dataframe to be stored in memory once evaluated
PC_ITM_DF.persist()

// mark this dataframe to be broadcast
broadcast(PC_ITM_DF)

// perform an action to force the evaluation
PC_ITM_DF.count()

这样做将确保数据框为

  • 加载到内存中(持久)
  • 注册为临时表以在您的SQL查询中使用
  • 标记为广播,因此将被发送给所有执行者

现在运行sqlContext.sql("JOIN Query").show时,您现在应该在Spark UI的SQL选项卡中看到广播哈希联接".

When you now run sqlContext.sql("JOIN Query").show you should now see a 'broadcast hash join' in the SQL tab of your Spark UI.

这篇关于如何在Spark中访问广播的DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆