广播哈希联接-迭代 [英] Broadcast hash join - Iterative

查看:113
本文介绍了广播哈希联接-迭代的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我们有一个足够小以适合内存的数据帧时,我们在Spark中使用广播哈希联接.当小数据框的大小小于spark.sql.autoBroadcastJoinThreshold时 我对此几乎没有疑问.

We use broadcast hash join in Spark when we have one dataframe small enough to get fit into memory. When the size of small dataframe is below spark.sql.autoBroadcastJoinThreshold I have few questions around this.

我们暗示要广播的小数据帧的生命周期是多少?它会在内存中保留多长时间?我们如何控制它?

What is the life cycle of the small dataframe which we hint as broadcast? For how long it will remain in memory? How can we control it?

例如,如果我使用广播哈希连接将大型数据框与小型数据框连接了两次.第一次执行联接时,它将把较小的数据帧广播到工作节点并执行联接,同时避免对较大的数据帧数据进行混洗.

For example if I have joined a big dataframe with small dataframe two times using broadcast hash join. when first join performs it will broadcast the small dataframe to worker nodes and perform the join while avoiding shuffling of big dataframe data.

我的问题是,执行者将保留广播数据帧的副本多长时间?它会保留在内存中直到会话结束吗?否则,一旦我们采取任何措施,它将被清除.我们可以控制还是清除它?或者我只是在错误的方向上思考...

My question is that for how long will executor keep a copy of broadcast dataframe? Will it remain in memory till session ends? Or it will get cleared once we have taken any action. can we control or clear it? Or I am just thinking in wrong direction...

推荐答案

至少在Spark 2.4.0中,您的问题的答案是,数据帧将保留在驱动程序进程的内存中,直到SparkContext完成为止.是,直到您的应用程序结束.

The answer to your question, at least in Spark 2.4.0, is that the dataframe will remain in memory on the driver process until the SparkContext is completed, that is, until your application ends.

广播联接实际上是使用广播变量实现的,但是使用DataFrame API时,您无法访问基础广播变量.内部使用它后,Spark本身不会销毁此变量,因此它一直存在.

Broadcast joins are in fact implemented using broadcast variables, but when using the DataFrame API you do not get access to the underling broadcast variable. Spark itself does not destroy this variable after it uses it internally, so it just stays around.

具体来说,如果您查看BroadcastExchangeExec的代码(

Specifically, if you look at the code of BroadcastExchangeExec (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala), you can see that it creates a private variable relationFuture which holds the Broadcast variable. This private variable is only used in this class. There is no way for you as a user to get access to it to call destroy on it, and nowhere in the curretn implementation does Spark call it for you.

这篇关于广播哈希联接-迭代的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆