在加入Spark之前如何正确应用HashPartitioner? [英] How to properly apply HashPartitioner before a join in Spark?

查看:128
本文介绍了在加入Spark之前如何正确应用HashPartitioner?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为了减少两个RDD的连接过程中的重排,我决定首先使用HashPartitioner对其进行分区.这是我的方法.我做对了吗,还是有更好的方法呢?

To reduce shuffling during the joining of two RDDs, I decided to partition them using HashPartitioner first. Here is how I do it. Am I doing it correctly, or is there a better way to do this?

val rddA = ...
val rddB = ...

val numOfPartitions = rddA.getNumPartitions

val rddApartitioned = rddA.partitionBy(new HashPartitioner(numOfPartitions))
val rddBpartitioned = rddB.partitionBy(new HashPartitioner(numOfPartitions))

val rddAB = rddApartitioned.join(rddBpartitioned)

推荐答案

为减少在两个RDD联接期间的混洗,

To reduce shuffling during the joining of two RDDs,

令人惊讶的普遍误解是,重新分配减少或什至消除了洗牌. 不是.重新分区是最原始形式的shuffle .它不会节省时间,带宽或内存.

It is surprisingly common misconception that repartitoning reduces or even eliminates shuffles. It doesn't. Repartitioning is shuffle in its purest form. It doesn't save time, bandwidth or memory.

使用主动分区程序的原理不同-它允许您随机播放一次,并重复使用状态,执行多个按键操作,而无需进行其他随机播放(据我所知) ,不一定没有额外的网络流量,因为共分区并不意味着共址,但在单个动作).

The rationale behind using proactive partitioner is different - it allows you to shuffle once, and reuse the state, to perform multiple by-key operations, without additional shuffles (though as far as I am aware, not necessarily without additional network traffic, as co-partitioning doesn't imply co-location, excluding cases where shuffles occurred in a single actions).

所以您的代码是正确的,但是一旦您加入却一无所获.

So your code is correct, but in a case where you join once it doesn't buy you anything.

这篇关于在加入Spark之前如何正确应用HashPartitioner?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆