如何提示排序合并连接或混洗哈希连接(并跳过广播哈希连接)? [英] How to hint for sort merge join or shuffled hash join (and skip broadcast hash join)?

查看:15
本文介绍了如何提示排序合并连接或混洗哈希连接(并跳过广播哈希连接)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 Spark 2.1 中遇到了 join 问题.尽管表非常大(1400 万行),但 Spark(错误地?)选择了广播散列 join.然后作业崩溃,因为没有足够的内存,Spark 以某种方式尝试将广播片段持久化到磁盘,然后导致超时.

I have an issue with a join in Spark 2.1. Spark (wrongly?) chooses a broadcast-hash join although the table is very large (14 million rows). The job then crashes because there is not enough memory and Spark somehow tries to persist the broadcast pieces to disk, which then lead to a timeout.

所以,我知道有一个查询提示可以强制进行广播连接(org.apache.spark.sql.functions.broadcast),但是还有一种方法可以强制使用另一种连接算法?

So, I know there is a query hint to force a broadcast-join (org.apache.spark.sql.functions.broadcast), but is there also a way to force another join algorithm?

我通过设置 spark.sql.autoBroadcastJoinThreshold=0 解决了我的问题,但我更喜欢另一种更精细的解决方案,即不全局禁用广播连接.

I solved my issue by setting spark.sql.autoBroadcastJoinThreshold=0, but I would prefer another solution which is more granular, i.e. not disable the broadcast join globally.

推荐答案

如果可以使用广播散列连接(通过广播提示或关系的总大小),Spark SQL 会选择它而不是其他连接(请参阅 JoinSelection 执行计划策略.

If a broadcast hash join can be used (by the broadcast hint or by total size of a relation), Spark SQL chooses it over other joins (see JoinSelection execution planning strategy).

话虽如此,不要强制广播散列连接(在左侧或右侧连接侧使用 broadcast 标准函数)或使用 spark 禁用广播散列连接的首选项.sql.autoBroadcastJoinThreshold0 或负数.

With that said, don't force a broadcast hash join (using broadcast standard function on the left or right join side) or disable the preference for a broadcast hash join using spark.sql.autoBroadcastJoinThreshold to be 0 or negative.

这篇关于如何提示排序合并连接或混洗哈希连接(并跳过广播哈希连接)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆