在哪里学习DAG如何工作的RDD在幕后? [英] Where to learn how DAG works under the covers in RDD?

查看:165
本文介绍了在哪里学习DAG如何工作的RDD在幕后?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有$ P $的星火研究论文pscribed一个新的分布式编程模型相比于传统的Hadoop马preduce,声称在许多情况下,特别是机器学习的简化和广阔的性能提升。但是,材料揭开内部机制弹性分布式数据集向无环图似乎缺乏在本文中。

The Spark research paper has prescribed a new distributed programming model over classic Hadoop MapReduce, claiming the simplification and vast performance boost in many cases specially on Machine Learning. However, the material to uncover the internal mechanics on Resilient Distributed Datasets with Directed Acyclic Graph seems lacking in this paper.

它应该更好地通过调查源$ C ​​$ C学到了什么?

Should it be better learned by investigating the source code?

推荐答案

连我一直在寻找的网站,了解如何火花从RDD计算DAG,随后执行任务。

Even i have been looking in the web to learn about how spark computes the DAG from the RDD and subsequently executes the task.

在较高的水平,当任何行动呼吁RDD,星火创建DAG并将其提交到DAG调度。

At high level, when any action is called on the RDD, Spark creates the DAG and submits it to the DAG scheduler.


  • 该DAG调度划分成运营商的任务阶段。的一个阶段包括基于输入数据的分区的任务。该DAG调度管道运营商在一起。对于如许多地图运营商可以在单一阶段中进行调度。 DAG的调度的最终结果是一组阶段。

  • The DAG scheduler divides operators into stages of tasks. A stage is comprised of tasks based on partitions of the input data. The DAG scheduler pipelines operators together. For e.g. Many map operators can be scheduled in a single stage. The final result of a DAG scheduler is a set of stages.

到任务Scheduler.The任务调度传递阶段启动通过集群管理任务(星火独立/纱/ Mesos)。任务计划程序不知道这个阶段的依赖关系。

The Stages are passed on to the Task Scheduler.The task scheduler launches tasks via cluster manager (Spark Standalone/Yarn/Mesos). The task scheduler doesn't know about dependencies of the stages.

工人执行对从属的任务。

The Worker executes the tasks on the Slave.

让我们来星火如何构建DAG。

Let's come to how Spark builds the DAG.

在较高的水平,还有一些可以应用到RDDS,即窄改造和转型宽两个转变。广角转换基本上导致阶段的界限。

At high level, there are two transformations that can be applied onto the RDDs, namely narrow transformation and wide transformation. Wide transformations basically result in stage boundaries.

窄改造 - 不要求跨分区要改组的数据。例如,地图,过滤器等。

Narrow transformation - doesn't require the data to be shuffled across the partitions. for example, Map, filter etc..

宽改造 - 需要对数据进行洗牌例如,reduceByKey等。

wide transformation - requires the data to be shuffled for example, reduceByKey etc..

让我们来计算日志消息多少出现严重的每一层,

Let's take an example of counting how many log messages appear at each level of severity,

以下是与严重性水平开始日志文件,

Following is the log file that starts with the severity level,

INFO I'm Info message
WARN I'm a Warn message
INFO I'm another Info message

和创建以下斯卡拉code提取相同,

and create the following scala code to extract the same,

val input = sc.textFile("log.txt")
val splitedLines = input.map(line => line.split(" "))
                        .map(words => (words(0), 1))
                        .reduceByKey{(a,b) => a + b}

该命令序列隐含地定义将要当一个动作叫以后使用RDD对象(RDD血统)的DAG。每个RDD维护的指针的一个或多个父母与关于它具有与父什么样的关系的元数据。例如,当我们调用 VAL B = a.map()在RDD中,RDD B 保持一个参考其父 A ,这是一个谱系。

This sequence of commands implicitly defines a DAG of RDD objects (RDD lineage) that will be used later when an action is called. Each RDD maintains a pointer to one or more parents along with the metadata about what type of relationship it has with the parent. For example, when we call val b = a.map() on a RDD, the RDD b keeps a reference to its parent a, that's a lineage.

要显示RDD的血统,星火提供了调试方法 toDebugString()。例如在执行 toDebugString() splitedLines RDD,将输出以下内容:

To display the lineage of an RDD, Spark provides a debug method toDebugString(). For example executing toDebugString() on the splitedLines RDD, will output the following:

(2) ShuffledRDD[6] at reduceByKey at <console>:25 []
    +-(2) MapPartitionsRDD[5] at map at <console>:24 []
    |  MapPartitionsRDD[4] at map at <console>:23 []
    |  log.txt MapPartitionsRDD[1] at textFile at <console>:21 []
    |  log.txt HadoopRDD[0] at textFile at <console>:21 []

的第一行(从底部)示出了输入RDD。我们通过调用创建此RDD sc.textFile()。下面是从给定的RDD创建的DAG图的更示意图。

The first line (from the bottom) shows the input RDD. We created this RDD by calling sc.textFile(). Below is the more diagrammatic view of the DAG graph created from the given RDD.

一旦DAG是建立,星火计划程序创建一个物理执行计划。如上所述,在DAG调度拆分图表成多个阶段,阶段创建基于变换。狭窄的转变将进行分组(管道内衬)连成一个单一的阶段。因此,对于我们的例子中,星火可以建立两个阶段执行如下:

Once the DAG is build, the Spark scheduler creates a physical execution plan. As mentioned above, the DAG scheduler splits the graph into multiple stages, the stages are created based on the transformations. The narrow transformations will be grouped (pipe-lined) together into a single stage. So for our example, Spark will create two stage execution as follows:

该DAG调度会然后提交阶段进入任务调度。提交的任务的数量取决于在文本文件的分区present的数目。福克斯例如,请考虑我们在这个例子中4个分区,那么将提供有足够的奴隶/芯4集创建和并行提交任务。下图说明了这一点的详细信息:

The DAG scheduler will then submit the stages into the task scheduler. The number of tasks submitted depends on the number of partitions present in the textFile. Fox example consider we have 4 partitions in this example, then there will be 4 set of tasks created and submitted in parallel provided there are enough slaves/cores. Below diagram illustrates this in more detail:

有关更详细的信息,我建议你去通过下面的YouTube视频,其中星火创作者对DAG和执行计划,终生深入细节给。

For more detailed information i suggest you to go through the following youtube videos where the Spark creators give in depth details about the DAG and execution plan and lifetime.


  1. 高级阿帕奇火花萨米尔Farooqui(Databricks)

  2. 星火塔内有更深的理解 - 亚伦·戴维森(Databricks)

  3. 介绍AmpLab星火内幕

  1. Advanced Apache Spark- Sameer Farooqui (Databricks)
  2. A Deeper Understanding of Spark Internals - Aaron Davidson (Databricks)
  3. Introduction to AmpLab Spark Internals

这篇关于在哪里学习DAG如何工作的RDD在幕后?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆