星火 - 重新分区()VS COALESCE() [英] Spark - repartition() vs coalesce()

查看:134
本文介绍了星火 - 重新分区()VS COALESCE()的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

据星火学习

请重新分区,你的数据是一个相当昂贵的操作。
  星火也有再分配()称为COALESCE(),允许避免数据移动的优化版本,但只有当你正在减少RDD分区数目。

Keep in mind that repartitioning your data is a fairly expensive operation. Spark also has an optimized version of repartition() called coalesce() that allows avoiding data movement, but only if you are decreasing the number of RDD partitions.

一个区别我明白的是,与再分配()分区的数目可以增加/减少,但与聚结()分区的数量只能降低。

One difference I get is that with repartition() the number of partitions can be increased/decreased, but with coalesce() the number of partitions can only be decreased.

如果分区有s在多台机器和COALESCE()$ P $垫运行时,它如何才能避免数据移动?

If the partitions are spread across multiple machines and coalesce() is run, how can it avoid data movement?

推荐答案

这避免了完全洗牌。如果它知道的数量正在减少,然后执行者可以放心地继续分区的最小数量的数据,仅移动数据关闭多余的节点上,我们不停的节点。

It avoids a full shuffle. If it's known that the number is decreasing then the executor can safely keep data on the minimum number of partitions, only moving the data off the extra nodes, onto the nodes that we kept.

所以,它会是这样的:

Node 1 = 1,2,3
Node 2 = 4,5,6
Node 3 = 7,8,9
Node 4 = 10,11,12

然后合并下降到2个分区:

Node 1 = 1,2,3 + (10,11,12)
Node 3 = 7,8,9 + (4,5,6)

注意,节点1和节点3并不需要其原始数据移动

Notice that Node 1 and Node 3 did not require its original data to move.

这篇关于星火 - 重新分区()VS COALESCE()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆