Apache Spark调度程序如何将文件拆分为任务? [英] How does the Apache Spark scheduler split files into tasks?
问题描述
在spark-summit 2014中,Aaron在其幻灯片的第17页上发表了对Spark Internals的更深入理解,其中显示了将一个阶段分为以下四个任务:
In spark-summit 2014, Aaron gives the speak A Deeper Understanding of Spark Internals , in his slide, page 17 show a stage has been splited into 4 tasks as bellow:
在这里,我想了解有关如何将阶段分为任务的三件事?
Here I wanna know three things about how does a stage be splited into tasks?
-
在上面的示例中,看来任务号是基于文件号创建的,对吗?
in this example above, it seems that tasks' number are created based on the file number, am I right?
如果我的观点是正确的,那么如果目录名下只有3个文件,那么它会创建3个任务吗?
if I'm right in point 1, so if there was just 3 files under directory names, will it just create 3 tasks?
如果我对要点2没错,如果只有一个但非常大的文件怎么办?它将这个阶段分为1个任务吗?如果数据来自流数据源怎么办?
If I'm right in point 2, what if there is just one but very large file? Does it just split this stage into 1 task? And what if when the data is coming from a streaming data source?
非常感谢,我对如何将阶段划分为任务感到困惑.
thanks a lot, I feel confused in how does the stage been splited into tasks.
推荐答案
您可以将整个过程的分区数(分割数)配置为作业的第二个参数,例如如果要3个分区,则进行并行化:
You can configure the # of partitions (splits) for the entire process as the second parameter to a job, e.g. for parallelize if we want 3 partitions:
a = sc.parallelize(myCollection, 3)
Spark会将作品分为相对均匀的大小(*).大文件将相应细分-您可以通过以下方式查看实际大小:
Spark will divide the work into relatively even sizes (*) . Large files will be broken down accordingly - you can see the actual size by:
rdd.partitions.size
因此,不,您将不会以单个Worker长时间在单个文件上费力.
So no you will not end up with single Worker chugging away for a long time on a single file.
(*)如果文件很小,则可能会更改此处理.但是无论如何,大文件 都会遵循这种模式.
(*) If you have very small files then that may change this processing. But in any case large files will follow this pattern.
这篇关于Apache Spark调度程序如何将文件拆分为任务?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!