广播哈希联接-迭代 [英] Broadcast hash join - Iterative

查看：113 发布时间：2020/9/4 3:06:46 apache-spark pyspark apache-spark-sql

本文介绍了广播哈希联接-迭代的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

当我们有一个足够小以适合内存的数据帧时，我们在Spark中使用广播哈希联接.当小数据框的大小小于spark.sql.autoBroadcastJoinThreshold时我对此几乎没有疑问.

We use broadcast hash join in Spark when we have one dataframe small enough to get fit into memory. When the size of small dataframe is below spark.sql.autoBroadcastJoinThreshold I have few questions around this.

我们暗示要广播的小数据帧的生命周期是多少?它会在内存中保留多长时间?我们如何控制它?

What is the life cycle of the small dataframe which we hint as broadcast? For how long it will remain in memory? How can we control it?

例如，如果我使用广播哈希连接将大型数据框与小型数据框连接了两次.第一次执行联接时，它将把较小的数据帧广播到工作节点并执行联接，同时避免对较大的数据帧数据进行混洗.

For example if I have joined a big dataframe with small dataframe two times using broadcast hash join. when first join performs it will broadcast the small dataframe to worker nodes and perform the join while avoiding shuffling of big dataframe data.

我的问题是，执行者将保留广播数据帧的副本多长时间?它会保留在内存中直到会话结束吗?否则，一旦我们采取任何措施，它将被清除.我们可以控制还是清除它?或者我只是在错误的方向上思考...

My question is that for how long will executor keep a copy of broadcast dataframe? Will it remain in memory till session ends? Or it will get cleared once we have taken any action. can we control or clear it? Or I am just thinking in wrong direction...

广播哈希联接-迭代 [英] Broadcast hash join - Iterative

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

广播哈希联接-迭代 [英] Broadcast hash join - Iterative

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭