Apache Spark调度程序如何将文件拆分为任务? [英] How does the Apache Spark scheduler split files into tasks?

查看:165
本文介绍了Apache Spark调度程序如何将文件拆分为任务?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在spark-summit 2014中,Aaron在其幻灯片的第17页上发表了对Spark Internals的更深入理解,其中显示了将一个阶段分为以下四个任务:

In spark-summit 2014, Aaron gives the speak A Deeper Understanding of Spark Internals , in his slide, page 17 show a stage has been splited into 4 tasks as bellow:

在这里,我想了解有关如何将阶段分为任务的三件事?

Here I wanna know three things about how does a stage be splited into tasks?

  1. 在上面的示例中,看来任务号是基于文件号创建的,对吗?

  1. in this example above, it seems that tasks' number are created based on the file number, am I right?

如果我的观点是正确的,那么如果目录名下只有3个文件,那么它会创建3个任务吗?

if I'm right in point 1, so if there was just 3 files under directory names, will it just create 3 tasks?

如果我对要点2没错,如果只有一个但非常大的文件怎么办?它将这个阶段分为1个任务吗?如果数据来自流数据源怎么办?

If I'm right in point 2, what if there is just one but very large file? Does it just split this stage into 1 task? And what if when the data is coming from a streaming data source?

非常感谢,我对如何将阶段划分为任务感到困惑.

thanks a lot, I feel confused in how does the stage been splited into tasks.

推荐答案

您可以将整个过程的分区数(分割数)配置为作业的第二个参数,例如如果要3个分区,则进行并行化:

You can configure the # of partitions (splits) for the entire process as the second parameter to a job, e.g. for parallelize if we want 3 partitions:

a = sc.parallelize(myCollection, 3)

Spark会将作品分为相对均匀的大小(*).大文件将相应细分-您可以通过以下方式查看实际大小:

Spark will divide the work into relatively even sizes (*) . Large files will be broken down accordingly - you can see the actual size by:

rdd.partitions.size

因此,不,您将不会以单个Worker长时间在单个文件上费力.

So no you will not end up with single Worker chugging away for a long time on a single file.

(*)如果文件很小,则可能会更改此处理.但是无论如何,大文件 都会遵循这种模式.

(*) If you have very small files then that may change this processing. But in any case large files will follow this pattern.

这篇关于Apache Spark调度程序如何将文件拆分为任务?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆