Spark合并更多分区 [英] Spark Coalesce More Partitions

查看：138 发布时间：2021/4/8 20:10:47 apache-spark rdd coalesce

本文介绍了Spark合并更多分区的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个火花作业，可以处理大量数据并将结果写入S3.在处理期间，我可能有超过5000个分区.在我写S3之前，我想减少分区数，因为每个分区都作为文件写出.

I have a spark job that processes a large amount of data and writes the results to S3. During processing I might have in excess of 5000 partitions. Before I write to S3 I want to reduce the number of partitions since each partition is written out as a file.

在某些其他情况下，我在处理过程中可能只有50个分区.如果出于性能原因我想合并而不是重新分区，那将会发生什么.

In some other cases I may only have 50 partitions during processing. If I wanted to coalesce rather than repartition for performance reasons what would happen.

从文档中说，仅当输出分区的数目小于输入分区的数目时才应使用合并，但如果不是，那么会发生什么，它似乎不会引起错误呢?会导致数据不正确或性能问题吗?

From the docs it says coalesce should only be used if the number of output partitions is less than the input but what happens if it isn't, it doesn't seem to cause an error? Does it cause the data to be incorrect or performance problems?

我试图避免对RDD进行计数，以确定是否有超出我的输出限制的分区，以及是否合并.

I am trying to avoid having to do a count of my RDD to determine if I have more partitions than my output limit and if so coalesce.

Spark合并更多分区 [英] Spark Coalesce More Partitions

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark合并更多分区 [英] Spark Coalesce More Partitions

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭