spark SQL 中的共同分区连接 [英] Co-partitioned joins in spark SQL

查看:27
本文介绍了spark SQL 中的共同分区连接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有任何提供联合分区连接的 Spark SQL 数据源的实现 - 最有可能是通过 CoGroupRDD?我在现有的 Spark 代码库中没有看到任何用途.

Are there any implementations of Spark SQL DataSources that offer Co-partition joins - most likely via the CoGroupRDD? I did not see any uses within the existing Spark codebase.

动机是在两个表具有相同数量和相同分区键范围的情况下大大减少混洗流量:在这种情况下,将使用 Mx1 而不是 MxN 随机扇出.

The motivation would be to greatly reduce the shuffle traffic in the case that two tables have the same number and same ranges of partitioning keys: in that case there would be a Mx1 instead of an MxN shuffle fanout.

目前 Spark SQL 中唯一大规模的连接实现似乎是 ShuffledHashJoin - 这确实需要 MxN shuffle fanout,因此很贵.

The only large-scale implementation of joins presently in Spark SQL seems to be ShuffledHashJoin - which does require the MxN shuffle fanout and thus is expensive.

推荐答案

我认为您正在寻找 Bucket Join 优化 应该在 Spark 2.0 中实现.

I think you are looking for the Bucket Join optimization that should be coming in Spark 2.0.

在 1.6 中,您可以完成类似的操作,但只能通过缓存数据.SPARK-4849

In 1.6 you can accomplish something similar, but only by caching the data. SPARK-4849

这篇关于spark SQL 中的共同分区连接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆