Spark - 重新分区()与合并() [英] Spark - repartition() vs coalesce()

查看:52
本文介绍了Spark - 重新分区()与合并()的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据学习星火

请记住,重新分区您的数据是一项相当昂贵的操作.Spark 还有一个优化版本的 repartition() 叫做 coalesce() 可以避免数据移动,但前提是你要减少 RDD 分区的数量.

Keep in mind that repartitioning your data is a fairly expensive operation. Spark also has an optimized version of repartition() called coalesce() that allows avoiding data movement, but only if you are decreasing the number of RDD partitions.

我得到的一个区别是,使用 repartition() 可以增加/减少分区的数量,但是使用 coalesce() 只能减少分区的数量.

One difference I get is that with repartition() the number of partitions can be increased/decreased, but with coalesce() the number of partitions can only be decreased.

如果分区分布在多台机器上并且运行coalesce(),如何避免数据移动?

If the partitions are spread across multiple machines and coalesce() is run, how can it avoid data movement?

推荐答案

它避免了 full shuffle.如果知道数量在减少,那么执行程序可以安全地将数据保存在最少数量的分区上,只需将数据从额外节点移到我们保留的节点上.

It avoids a full shuffle. If it's known that the number is decreasing then the executor can safely keep data on the minimum number of partitions, only moving the data off the extra nodes, onto the nodes that we kept.

所以,它会是这样的:

Node 1 = 1,2,3
Node 2 = 4,5,6
Node 3 = 7,8,9
Node 4 = 10,11,12

然后合并到2个分区:

Node 1 = 1,2,3 + (10,11,12)
Node 3 = 7,8,9 + (4,5,6)

请注意,节点 1 和节点 3 不需要移动其原始数据.

Notice that Node 1 and Node 3 did not require its original data to move.

这篇关于Spark - 重新分区()与合并()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆