Apache Spark中的混洗与非混洗合并 [英] Shuffled vs non-shuffled coalesce in Apache Spark

查看：81 发布时间：2020/9/4 7:08:10 scala apache-spark distributed-computing

本文介绍了Apache Spark中的混洗与非混洗合并的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在将RDD写入文件之前立即执行以下转换时，它们之间有什么区别?

What is the difference between the following transformations when they are executed right before writing RDD to a file?

coalesce(1，shuffle = true)
coalesce(1，shuffle = false)

代码示例:

val input = sc.textFile(inputFile)
val filtered = input.filter(doSomeFiltering)
val mapped = filtered.map(doSomeMapping)

mapped.coalesce(1, shuffle = true).saveAsTextFile(outputFile)
vs
mapped.coalesce(1, shuffle = false).saveAsTextFile(outputFile)

它与collect()相比如何?我完全知道，Spark保存方法将以HDFS样式的结构存储它，但是我对collect()和改组/未改组/非改组的coalesce()的数据分区方面更感兴趣.

And how does it compare with collect()? I'm fully aware that Spark save methods will store it with HDFS-style structure, however I'm more interested in data partitioning aspects of collect() and shuffled/non-shuffled coalesce().

Apache Spark中的混洗与非混洗合并 [英] Shuffled vs non-shuffled coalesce in Apache Spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Apache Spark中的混洗与非混洗合并 [英] Shuffled vs non-shuffled coalesce in Apache Spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭