广播加入火花不适用于左外 [英] Broadcast join in spark not working for left outer

查看:35
本文介绍了广播加入火花不适用于左外的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个小表 (2k) 记录和大表 (500 万) 记录.我需要从小表中获取所有数据,并且只从大表中获取匹配的数据,因此为了实现这一点,我在查询下面执行了select/*+ broadcast(small)*/small.* from small left outer join large虽然查询返回正确的结果,但是当我检查查询计划时,它显示排序合并广播哈希连接.如果小桌是左桌不能广播有什么限制吗,那有什么出路.

I have a small table (2k ) records and big table (5 mil) records.I need to fetch all data from small tables and only matching data from large table so to achieve this I have executed below query select /*+ broadcast(small)*/ small.* From small left outer join large Though the query return correct result but when I check the query plan it shows sort merged broadcast hash join. Is there any limitations if small table is left table we can't broadcast and what's the way out then.

推荐答案

由于您想从小表而不是大表中选择完整的数据集,Spark 不强制执行广播连接.当您更改连接顺序或转换为等连接时,spark 会很乐意强制执行广播连接.

As you want to select complete dataset from small table rather than big table, Spark is not enforcing broadcast join. When you change join sequence or convert to equi-join, spark would happily enforce broadcast join.

例如:

  1. Big-Table 左外连接 Small-Table -- 广播启用
  2. 小表左外连接大表 -- 广播禁用

原因:*Spark 将向所有存在大表数据的数据节点共享小表,即广播表.在您的情况下,我们需要小表中的所有数据,但只需要大表中的匹配数据.所以spark不知道这个记录是否在另一个数据节点匹配,甚至根本没有匹配.由于这种歧义,它无法从小表中选择所有记录(如果这是分布式的).所以在这种情况下,spark 没有使用 Broadcast Join.*

Reason: *Spark will share small table a.k.a broadcast table to all data nodes where big table data is present. In your case, we need all the data from small table but only matching data from big table. So spark doesn't know if this record was matched at another data node or even there was no match at all. Due to this ambiguity it cannot select all the records from small table(if this was distributed). So spark is not using Broadcast Join in this case. *

这篇关于广播加入火花不适用于左外的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆