如何提示排序合并联接或混搭哈希联接(并跳过广播哈希联接)? [英] How to hint for sort merge join or shuffled hash join (and skip broadcast hash join)?

查看:80
本文介绍了如何提示排序合并联接或混搭哈希联接(并跳过广播哈希联接)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Spark 2.1中遇到join的问题.尽管表很大(1400万行),但Spark(错误地?)选择了广播哈希join.然后,由于没有足够的内存,作业崩溃了,Spark以某种方式尝试将广播的片段保存到磁盘上,从而导致超时.

I have an issue with a join in Spark 2.1. Spark (wrongly?) chooses a broadcast-hash join although the table is very large (14 million rows). The job then crashes because there is not enough memory and Spark somehow tries to persist the broadcast pieces to disk, which then lead to a timeout.

因此,我知道有一个查询提示可以强制进行广播联接(org.apache.spark.sql.functions.broadcast),但是还有一种方法可以强制执行另一种联接算法吗?

So, I know there is a query hint to force a broadcast-join (org.apache.spark.sql.functions.broadcast), but is there also a way to force another join algorithm?

我通过设置spark.sql.autoBroadcastJoinThreshold=0解决了我的问题,但是我希望使用另一个更精细的解决方案,即不全局禁用广播连接.

I solved my issue by setting spark.sql.autoBroadcastJoinThreshold=0, but I would prefer another solution which is more granular, i.e. not disable the broadcast join globally.

推荐答案

如果可以使用广播哈希联接(按广播提示或关系的总大小),Spark SQL会选择它而不是其他联接(请参见

If a broadcast hash join can be used (by the broadcast hint or by total size of a relation), Spark SQL chooses it over other joins (see JoinSelection execution planning strategy).

话虽如此,请不要强制广播哈希联接(在左侧或右侧联接侧使用broadcast标准功能)或将使用spark.sql.autoBroadcastJoinThreshold的广播哈希联接的首选项设置为0或否定

With that said, don't force a broadcast hash join (using broadcast standard function on the left or right join side) or disable the preference for a broadcast hash join using spark.sql.autoBroadcastJoinThreshold to be 0 or negative.

这篇关于如何提示排序合并联接或混搭哈希联接(并跳过广播哈希联接)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆