Apache Spark调度程序如何将文件拆分为任务? [英] How does the Apache Spark scheduler split files into tasks?

查看：165 发布时间：2020/9/4 3:17:17 apache-spark bigdata

本文介绍了Apache Spark调度程序如何将文件拆分为任务?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在spark-summit 2014中，Aaron在其幻灯片的第17页上发表了对Spark Internals的更深入理解，其中显示了将一个阶段分为以下四个任务:

In spark-summit 2014, Aaron gives the speak A Deeper Understanding of Spark Internals , in his slide, page 17 show a stage has been splited into 4 tasks as bellow:

在这里，我想了解有关如何将阶段分为任务的三件事?

Here I wanna know three things about how does a stage be splited into tasks?

在上面的示例中，看来任务号是基于文件号创建的，对吗?

in this example above, it seems that tasks' number are created based on the file number, am I right?

如果我的观点是正确的，那么如果目录名下只有3个文件，那么它会创建3个任务吗?

if I'm right in point 1, so if there was just 3 files under directory names, will it just create 3 tasks?

如果我对要点2没错，如果只有一个但非常大的文件怎么办?它将这个阶段分为1个任务吗?如果数据来自流数据源怎么办?

If I'm right in point 2, what if there is just one but very large file? Does it just split this stage into 1 task? And what if when the data is coming from a streaming data source?

非常感谢，我对如何将阶段划分为任务感到困惑.

thanks a lot, I feel confused in how does the stage been splited into tasks.

推荐答案

您可以将整个过程的分区数(分割数)配置为作业的第二个参数，例如如果要3个分区，则进行并行化:

You can configure the # of partitions (splits) for the entire process as the second parameter to a job, e.g. for parallelize if we want 3 partitions:

a = sc.parallelize(myCollection, 3)

Spark会将作品分为相对均匀的大小(*).大文件将相应细分-您可以通过以下方式查看实际大小:

Spark will divide the work into relatively even sizes (*) . Large files will be broken down accordingly - you can see the actual size by:

rdd.partitions.size

因此，不，您将不会以单个Worker长时间在单个文件上费力.

So no you will not end up with single Worker chugging away for a long time on a single file.

(*)如果文件很小，则可能会更改此处理.但是无论如何，大文件都会遵循这种模式.

(*) If you have very small files then that may change this processing. But in any case large files will follow this pattern.

这篇关于Apache Spark调度程序如何将文件拆分为任务?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Apache Spark调度程序如何将文件拆分为任务? [英] How does the Apache Spark scheduler split files into tasks?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Apache Spark调度程序如何将文件拆分为任务? [英] How does the Apache Spark scheduler split files into tasks?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭