Spark:即使输出数据很小，聚结也非常缓慢 [英] Spark: coalesce very slow even the output data is very small

查看：248 发布时间：2020/9/4 1:12:17 scala apache-spark coalesce

本文介绍了Spark:即使输出数据很小，聚结也非常缓慢的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在Spark中有以下代码:

I have the following code in Spark:

myData.filter(t => t.getMyEnum() == null)
      .map(t => t.toString)
      .saveAsTextFile("myOutput")

myOutput文件夹中有2000多个文件，但是只有几个t.getMyEnum()== null，因此只有很少的输出记录.由于我不想只搜索2000多个输出文件中的几个输出，因此我尝试使用合并合并输出，如下所示:

There are 2000+ files in the myOutput folder, but only a few t.getMyEnum() == null, so there are only very few output records. Since I don't want to search just a few outputs in 2000+ output files, I tried to combine the output using coalesce like below:

myData.filter(t => t.getMyEnum() == null)
      .map(t => t.toString)
      .coalesce(1, false)
      .saveAsTextFile("myOutput")

然后作业变得极慢！我想知道为什么它这么慢?只有几条输出记录分散在2000多个分区中?有没有更好的方法来解决此问题?

Then the job becomes EXTREMELY SLOW! I am wondering why it is so slow? There was just a few output records scattering in 2000+ partitions? Is there a better way to solve this problem?

推荐答案

如果您要进行剧烈的合并，例如到numPartitions = 1，这可能会导致您的计算在少于您希望的节点上进行(例如，在numPartitions = 1的情况下为一个节点).为了避免这种情况，您可以传递shuffle = true.这将增加一个随机播放步骤，但意味着当前的上游分区将并行执行(无论当前分区是什么).

if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, you can pass shuffle = true. This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).

注意:使用shuffle = true时，您实际上可以合并到更大的位置分区数.如果您的分区数量很少(例如100)，并且可能有几个分区异常大，那么这将很有用.调用Coalesce(1000，shuffle = true)将导致1000个分区，并使用哈希分区程序分配数据.

Note: With shuffle = true, you can actually coalesce to a larger number of partitions. This is useful if you have a small number of partitions, say 100, potentially with a few partitions being abnormally large. Calling coalesce(1000, shuffle = true) will result in 1000 partitions with the data distributed using a hash partitioner.

因此请尝试将true传递给coalesce函数.即

So try by passing the true to coalesce function. i.e.

myData.filter(_.getMyEnum == null)
      .map(_.toString)
      .coalesce(1, shuffle = true)
      .saveAsTextFile("myOutput")

这篇关于Spark:即使输出数据很小，聚结也非常缓慢的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Spark:即使输出数据很小，聚结也非常缓慢 [英] Spark: coalesce very slow even the output data is very small

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark:即使输出数据很小，聚结也非常缓慢 [英] Spark: coalesce very slow even the output data is very small

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭