为什么sortBy转换会触发Spark作业? [英] Why does sortBy transformation trigger a Spark job?
问题描述
根据Spark文档,只有RDD动作可以触发Spark作业,并且在调用动作时会延迟评估转换.
As per Spark documentation only RDD actions can trigger a Spark job and the transformations are lazily evaluated when an action is called on it.
我看到sortBy
转换函数被立即应用,并在SparkUI中显示为作业触发器.为什么?
I see the sortBy
transformation function is applied immediately and it is shown as a job trigger in the SparkUI. Why?
推荐答案
sortBy
使用sortByKey
实现,而sortByKey
取决于RangePartitioner
(JVM)或分区函数(Python).调用sortBy
/sortByKey
时,分区器(分区函数)将被初始化,并对输入的RDD进行采样以计算分区边界.您看到的作业与此过程相对应.
sortBy
is implemented using sortByKey
which depends on a RangePartitioner
(JVM) or partitioning function (Python). When you call sortBy
/ sortByKey
partitioner (partitioning function) is initialized eagerly and samples input RDD to compute partition boundaries. Job you see corresponds to this process.
仅当您对新创建的RDD
或其后代执行操作时,才会执行实际排序.
Actual sorting is performed only if you execute an action on the newly created RDD
or its descendants.
这篇关于为什么sortBy转换会触发Spark作业?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!