Spark Join *无*洗牌 [英] Spark join *without* shuffle

查看:101
本文介绍了Spark Join *无*洗牌的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试优化我的Spark应用程序工作.

I am trying to optimise my spark application job.

我试图了解以下问题的要点:在唯一键上连接DataFrame时如何避免随机播放?

I tried to understand the points from this question: How to avoid shuffles while joining DataFrames on unique keys?

  1. 我已确保必须进行联接操作的键分布在同一分区内(使用我的自定义分区程序).

  1. I have made sure that the keys on which join operation has to happen are distributed within the same partition (using my custom partitioner).

我也无法进行广播加入,因为根据情况我的数据可能很大.

I also cannot do a broadcast join because my data may be come large depending on situation.

在上述问题的答案中,重新分区仅优化了联接,但我需要的是联接而没有任何束缚.在分区内的键的帮助下进行联接操作就可以了.

In the answer of above mentioned question, repartitioning only optimises the join but What I need is join WITHOUT A SHUFFLE. I am just fine with the join operation with the help of keys within the partition.

有可能吗?如果不存在类似的功能,我想实现类似joinperpartition的功能.

Is it possible? I want to implement something like joinperpartition if similar functionality does not exists.

推荐答案

重新分区只会优化连接,但我需要的是无任何连接的连接

repartitioning only optimises the join but What I need is join WITHOUT A SHUFFLE

这不是事实.分区不仅可以优化"联接.重新分区将 Partitioner 绑定到您的RDD,RDD是地图侧连接的关键组件.

This is not true. Repartition does not only "optimize" the join. Repartition binds a Partitioner to your RDD, which is the key component for a map side join.

我确保必须进行联接操作的键分布在同一分区中

I have made sure that the keys on which join operation has to happen are distributed within the same partition

火花必须知道这一点.使用适当的API构建您的DataFrame,以使它们具有相同的 Partitioner ,而spark将负责其余的工作.

Spark must know about this. Build your DataFrames with the appropriate api's so that they have the same Partitioner, and spark will take care of the rest.

这篇关于Spark Join *无*洗牌的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆