Spark合并更多分区 [英] Spark Coalesce More Partitions

查看:138
本文介绍了Spark合并更多分区的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个火花作业,可以处理大量数据并将结果写入S3.在处理期间,我可能有超过5000个分区.在我写S3之前,我想减少分区数,因为每个分区都作为文件写出.

I have a spark job that processes a large amount of data and writes the results to S3. During processing I might have in excess of 5000 partitions. Before I write to S3 I want to reduce the number of partitions since each partition is written out as a file.

在某些其他情况下,我在处理过程中可能只有50个分区.如果出于性能原因我想合并而不是重新分区,那将会发生什么.

In some other cases I may only have 50 partitions during processing. If I wanted to coalesce rather than repartition for performance reasons what would happen.

从文档中说,仅当输出分区的数目小于输入分区的数目时才应使用合并,但如果不是,那么会发生什么,它似乎不会引起错误呢?会导致数据不正确或性能问题吗?

From the docs it says coalesce should only be used if the number of output partitions is less than the input but what happens if it isn't, it doesn't seem to cause an error? Does it cause the data to be incorrect or performance problems?

我试图避免对RDD进行计数,以确定是否有超出我的输出限制的分区,以及是否合并.

I am trying to avoid having to do a count of my RDD to determine if I have more partitions than my output limit and if so coalesce.

推荐答案

默认使用 PartitionCoalescer ,如果分区数大于当前分区数,并且您未设置 shuffle更改为 true ,则分区数保持不变.

With default PartitionCoalescer, if number of partitions is larger than a current number of partitions and you don't set shuffle to true then number of partitions stays unchanged.

coalesce 等效于 repartition ,其值等于 numPartitions .

这篇关于Spark合并更多分区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆