广播哈希连接 - 迭代 [英] Broadcast hash join - Iterative

查看:24
本文介绍了广播哈希连接 - 迭代的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我们有一个足够小的数据帧可以放入内存时,我们在 Spark 中使用广播哈希连接.当小数据帧的大小低于 spark.sql.autoBroadcastJoinThreshold我对此有几个问题.

We use broadcast hash join in Spark when we have one dataframe small enough to get fit into memory. When the size of small dataframe is below spark.sql.autoBroadcastJoinThreshold I have few questions around this.

我们提示为广播的小数据帧的生命周期是什么?它将在内存中保留多长时间?我们如何控制它?

What is the life cycle of the small dataframe which we hint as broadcast? For how long it will remain in memory? How can we control it?

例如,如果我使用广播哈希连接两次将大数据帧与小数据帧连接起来.当第一次加入时,它会将小数据帧广播到工作节点并执行加入,同时避免大数据帧数据的混洗.

For example if I have joined a big dataframe with small dataframe two times using broadcast hash join. when first join performs it will broadcast the small dataframe to worker nodes and perform the join while avoiding shuffling of big dataframe data.

我的问题是执行者将保留广播数据帧的副本多长时间?它会保留在内存中直到会话结束吗?或者一旦我们采取任何行动,它就会被清除.我们可以控制或清除它吗?或者我只是想错了方向...

My question is that for how long will executor keep a copy of broadcast dataframe? Will it remain in memory till session ends? Or it will get cleared once we have taken any action. can we control or clear it? Or I am just thinking in wrong direction...

推荐答案

至少在 Spark 2.4.0 中,您的问题的答案是数据帧将保留在驱动程序进程的内存中,直到 SparkContext 完成,即是,直到您的申请结束.

The answer to your question, at least in Spark 2.4.0, is that the dataframe will remain in memory on the driver process until the SparkContext is completed, that is, until your application ends.

广播连接实际上是使用广播变量实现的,但是在使用 DataFrame API 时,您无法访问底层广播变量.Spark本身在内部使用后不会销毁这个变量,所以它只是留在那里.

Broadcast joins are in fact implemented using broadcast variables, but when using the DataFrame API you do not get access to the underling broadcast variable. Spark itself does not destroy this variable after it uses it internally, so it just stays around.

具体来说,如果你看一下 BroadcastExchangeExec 的代码(https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala),你可以看到它创建了一个私有变量 relationFuture 来保存 Broadcast 变量.这个私有变量只在这个类中使用.作为用户,您无法访问它以对其调用 destroy,并且在当前的实现中,Spark 没有为您调用它.

Specifically, if you look at the code of BroadcastExchangeExec (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala), you can see that it creates a private variable relationFuture which holds the Broadcast variable. This private variable is only used in this class. There is no way for you as a user to get access to it to call destroy on it, and nowhere in the curretn implementation does Spark call it for you.

这篇关于广播哈希连接 - 迭代的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆