为什么 sortBy 转换会触发 Spark 作业? [英] Why does sortBy transformation trigger a Spark job?
问题描述
根据 Spark 文档,只有 RDD 操作可以触发 Spark 作业,并且在对其调用操作时会延迟评估转换.
As per Spark documentation only RDD actions can trigger a Spark job and the transformations are lazily evaluated when an action is called on it.
我看到 sortBy
转换函数被立即应用,它在 SparkUI 中显示为作业触发器.为什么?
I see the sortBy
transformation function is applied immediately and it is shown as a job trigger in the SparkUI. Why?
推荐答案
sortBy
是使用 sortByKey
实现的,它依赖于 RangePartitioner
(JVM) 或分区函数 (Python).当您调用 sortBy
/sortByKey
时,分区器(分区函数)会被急切地初始化并采样输入 RDD 以计算分区边界.您看到的作业与此流程相对应.
sortBy
is implemented using sortByKey
which depends on a RangePartitioner
(JVM) or partitioning function (Python). When you call sortBy
/ sortByKey
partitioner (partitioning function) is initialized eagerly and samples input RDD to compute partition boundaries. Job you see corresponds to this process.
仅当您对新创建的 RDD
或其后代执行操作时,才会执行实际排序.
Actual sorting is performed only if you execute an action on the newly created RDD
or its descendants.
这篇关于为什么 sortBy 转换会触发 Spark 作业?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!