星火内部工作 [英] Internal Work of Spark

查看:301
本文介绍了星火内部工作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

现在是一个天星火正在进行中。星火使用Scala语言加载和执行程序,也Python和Java。 RDD用来存储数据。但是,我无法理解星火的体系结构,它是如何在内部运行。

Now a days Spark is in progress. Spark used scala language to load and execute the program and also python and java. RDD is used to store the data. But, I can't understand the architecture of Spark, how it runs internally.

请告诉我Spark架构以及它是如何工作的内部?

Please tell me Spark Architecture as well as How it works internally?

推荐答案

连我一直在寻找的网站,了解星火的内部,下面是我可以学习和共享想到这里,

Even i have been looking in the web to learn about the internals of Spark, below is what i could learn and thought of sharing here,

火花围绕弹性分布式数据集(RDD),这是可以并行操作的元件的容错集合的概念。 RDDS支持两种类型的操作:转换,从现有的创建一个新的数据集和行动,其上运行的数据进行计算后的值返回驱动程序。

Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset.

星火转换的RDD转化成一种叫做DAG(向无环图),并开始执行,

Spark translates the RDD transformations into something called DAG (Directed Acyclic Graph) and starts the execution,

在较高的水平,当任何行动呼吁RDD,星火创建DAG并提交到DAG调度。

At high level, when any action is called on the RDD, Spark creates the DAG and submits to the DAG scheduler.


  • 该DAG调度划分成运营商的任务阶段。的一个阶段包括基于输入数据的分区的任务。该DAG调度管道运营商在一起。对于如许多地图运营商可以在单一阶段中进行调度。 DAG的调度的最终结果是一组阶段。

  • The DAG scheduler divides operators into stages of tasks. A stage is comprised of tasks based on partitions of the input data. The DAG scheduler pipelines operators together. For e.g. Many map operators can be scheduled in a single stage. The final result of a DAG scheduler is a set of stages.

阶段是在传递给任务Scheduler.The任务调度程序通过集群管理器启动任务。(火花独立/纱/ Mesos)。任务计划程序不知道这个阶段的依赖关系。

The Stages are passed on to the Task Scheduler.The task scheduler launches tasks via cluster manager.(Spark Standalone/Yarn/Mesos). The task scheduler doesn't know about dependencies of the stages.

工人执行对从属的任务。

The Worker executes the tasks on the Slave.

让我们来星火如何构建DAG。

Let's come to how Spark builds the DAG.

在较高的水平,还有一些可以应用到RDDS,即窄改造和转型宽两个转变。广角转换基本上导致阶段的界限。

At high level, there are two transformations that can be applied onto the RDDs, namely narrow transformation and wide transformation. Wide transformations basically result in stage boundaries.

窄改造 - 不要求跨分区要改组的数据。例如,地图,过滤器等。

Narrow transformation - doesn't require the data to be shuffled across the partitions. for example, Map, filter and etc..

宽改造 - 需要对数据进行洗牌例如,reduceByKey等。

wide transformation - requires the data to be shuffled for example, reduceByKey and etc..

让我们来计算日志消息多少出现严重的每一层,

Let's take an example of counting how many log messages appear at each level of severity,

以下是与严重性水平开始日志文件,

Following is the log file that starts with the severity level,

INFO I'm Info message
WARN I'm a Warn message
INFO I'm another Info message

和创建以下斯卡拉code提取相同,

and create the following scala code to extract the same,

val input = sc.textFile("log.txt")
val splitedLines = input.map(line => line.split(" "))
                        .map(words => (words(0), 1))
                        .reduceByKey{(a,b) => a + b}

该命令序列隐含地定义将要当一个动作叫以后使用RDD对象(RDD血统)的DAG。每个RDD维护的指针的一个或多个父与关于它具有与父什么样的关系的元数据。例如,当我们调用VAL B = a.map()在RDD中,RDD b保持对其父的引用,这是一个谱系。

This sequence of commands implicitly defines a DAG of RDD objects (RDD lineage) that will be used later when an action is called. Each RDD maintains a pointer to one or more parent along with the metadata about what type of relationship it has with the parent. For example, when we call val b = a.map() on a RDD, the RDD b keeps a reference to its parent a, that's a lineage.

要显示RDD的血统,星火提供了调试方法 toDebugString()方式。例如在 splitedLines RDD执行toDebugString(),将输出以下,

To display the lineage of an RDD, Spark provides a debug method toDebugString() method. For example executing toDebugString() on splitedLines RDD, will output the following,

(2) ShuffledRDD[6] at reduceByKey at <console>:25 []
    +-(2) MapPartitionsRDD[5] at map at <console>:24 []
    |  MapPartitionsRDD[4] at map at <console>:23 []
    |  log.txt MapPartitionsRDD[1] at textFile at <console>:21 []
    |  log.txt HadoopRDD[0] at textFile at <console>:21 []

的第一行(从底部)示出了输入RDD。我们通过调用sc.textFile创造了这个RDD()。请参阅从给定的RDD创建的DAG图的下面更示意图。

The first line (from bottom) shows the input RDD. We created this RDD by calling sc.textFile(). See below more diagrammatic view of the DAG graph created from the given RDD.

一旦DAG是建立,星火计划程序创建一个物理执行计划。如上所述,在DAG调度拆分图表成多个阶段,阶段创建基于变换。狭窄的转变将进行分组(管道内衬)连成一个单一的阶段。因此,对于我们的例子中,将星火如下创建两个阶段执行,

Once the DAG is build, Spark scheduler creates a physical execution plan. As mentioned above, the DAG scheduler splits the graph into multiple stages, the stages are created based on the transformations. The narrow transformations will be grouped (pipe-lined) together into a single stage. So for our example, Spark will create two stage execution as follows,

该DAG调度然后提交阶段进入任务调度。提交的任务的数量取决于在文本文件的分区present的数目。福克斯例如,请考虑我们在这个例子中4个分区,那么将有4集创建和提供,如果有足够的奴隶/内核并行提交任务。下图说明了这一点中更详细,

The DAG scheduler then submit the stages into the task scheduler. The number of tasks submitted depends on the number of partitions present in the textFile. Fox example consider we have 4 partitions in this example, then there will be 4 set of tasks created and submitted in parallel provided if there are enough slaves/cores. Below diagram illustrates this in bit more detail,

有关更详细的信息,我建议你去通过下面的YouTube视频,其中星火创作者对DAG和执行计划,终生深入细节给。

For more detailed information i suggest you to go through the following youtube videos where the Spark creators give in depth details about the DAG and execution plan and lifetime.


  1. 高级阿帕奇火花萨米尔Farooqui(Databricks)

  2. 星火塔内有更深的理解 - 亚伦·戴维森(Databricks)

  3. 介绍AmpLab星火内幕

  1. Advanced Apache Spark- Sameer Farooqui (Databricks)
  2. A Deeper Understanding of Spark Internals - Aaron Davidson (Databricks)
  3. Introduction to AmpLab Spark Internals

这篇关于星火内部工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆