当不满足所有选择标准时，Spark 会选择哪个连接? [英] Which join will Spark choose when all the selection criteria are not met?

查看：18 发布时间：2021/11/14 22:50:14 apache-spark join apache-spark-sql

本文介绍了当不满足所有选择标准时，Spark 会选择哪个连接?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我们知道在Spark中有3种join——Broadcast Join、Shuffle Join和Sort-Merge Join:

We know that in Spark have three types of joins -- Broadcast Join， Shuffle Join and Sort-Merge Join：

当小表加入大表时，使用广播加入；
当小表大于BroadcastJoinThreshold时，使用Shuffle Join；
当大表join，并且join key可以排序时，使用Sort-Merge Join；

如果有两个大表的join，join key无法排序怎么办?Spark 会选择哪种联接类型?

What happens in a case where there is a join of two big tables and the join key can't be sorted? Which join type Spark will choose?

推荐答案

Spark 3.0 及更高版本支持以下类型的连接:

Spark 3.0 and above supports these types of joins:

广播哈希联接 (BHJ)
随机哈希连接
随机排序合并连接 (SMJ)
广播嵌套循环连接 (BNLJ)
笛卡尔积连接

他们的选择最好在 SparkStrategies.scala:

Their selection is best outlined in the source code for SparkStrategies.scala:

  /**
   * Select the proper physical plan for join based on join strategy hints, the availability of
   * equi-join keys and the sizes of joining relations. Below are the existing join strategies,
   * their characteristics and their limitations.
   *
   * - Broadcast hash join (BHJ):
   *     Only supported for equi-joins, while the join keys do not need to be sortable.
   *     Supported for all join types except full outer joins.
   *     BHJ usually performs faster than the other join algorithms when the broadcast side is
   *     small. However, broadcasting tables is a network-intensive operation and it could cause
   *     OOM or perform badly in some cases, especially when the build/broadcast side is big.
   *
   * - Shuffle hash join:
   *     Only supported for equi-joins, while the join keys do not need to be sortable.
   *     Supported for all join types except full outer joins.
   *
   * - Shuffle sort merge join (SMJ):
   *     Only supported for equi-joins and the join keys have to be sortable.
   *     Supported for all join types.
   *
   * - Broadcast nested loop join (BNLJ):
   *     Supports both equi-joins and non-equi-joins.
   *     Supports all the join types, but the implementation is optimized for:
   *       1) broadcasting the left side in a right outer join;
   *       2) broadcasting the right side in a left outer, left semi, left anti or existence join;
   *       3) broadcasting either side in an inner-like join.
   *     For other cases, we need to scan the data multiple times, which can be rather slow.
   *
   * - Shuffle-and-replicate nested loop join (a.k.a. cartesian product join):
   *     Supports both equi-joins and non-equi-joins.
   *     Supports only inner like joins.
   */
object JoinSelection extends Strategy with PredicateHelper { ...

如前所述，应用选择的结果不仅取决于表的大小和键的可排序性，还取决于连接类型(INNER、LEFT/RIGHT, FULL) 并加入关键条件(equi- vs non-equi/theta).总的来说，在您的情况下，您可能会看到 Shuffle Hash 或 Broadcast Nested Loop.

As stated, the outcome of applying the selection depends not only on the size of the tables and sortability of the keys, but also on a join type (INNER, LEFT/RIGHT, FULL) and join key conditions (equi- vs non-equi/theta). Overall, seems like in your situation you'll be looking at either Shuffle Hash or Broadcast Nested Loop.

这篇关于当不满足所有选择标准时，Spark 会选择哪个连接?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

当不满足所有选择标准时，Spark 会选择哪个连接? [英] Which join will Spark choose when all the selection criteria are not met?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

当不满足所有选择标准时，Spark 会选择哪个连接? [英] Which join will Spark choose when all the selection criteria are not met?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭