什么是星火的工作? [英] What is Spark Job ?

查看:159
本文介绍了什么是星火的工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经有火花安装完成并执行一些测试用例设置主机和工作节点。这就是说,我的究竟是什么工作在星火上下文(不SparkContext)意味着一个很肥的困惑。我有以下问题

I have already done with spark installation and executed few testcases setting master and worker nodes. That said, I have a very fat confusion of what exactly a job is meant in Spark context(not SparkContext). I have below questions


  • 如何不同的是,从驱动程序的工作。

  • 应用程序本身是驱动程序的一部分?

  • Spark在某种程度上是提交一份工作?

我阅读星火机制的文档但还是这个东西不清楚我。

I read the Spark documention but still this thing is not clear for me.

尽管如此,我的实现是写火花的作业{}编程,用以向火花提交。

Having said, my implementation is to write spark jobs{programmatically} which would to a spark-submit.

一些示例如果可能,请帮助。这将是非常helpdful。

Kindly help with some example if possible . It would be very helpdful.

注意:请不要张贴火花链接,因为我已经尝试过了。尽管听起来问题幼稚,但仍然需要我的理解更加清晰。

Note: Kindly do not post spark links because I have already tried it. Even though the questions sounds naive but still I need more clarity in understanding.

推荐答案

好吧,术语可能永远是困难的,因为它依赖于上下文。在很多情况下,你可以使用作业提交到集群,这火花会提交的驱动程序。

Well, terminology can always be difficult since it depends on context. In many cases, you can be used to "submit a job to a cluster", which for spark would be to submit a driver program.

这是说,星火有他自己的工作的定义,直接从词汇:

That said, Spark has his own definition for "job", directly from the glossary:

工作的并行计算组成,获取多个任务
  响应星火行动催生了(例如保存,收取);你会看到的
  这个词在驱动程序的日志中。

Job A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save, collect); you'll see this term used in the driver's logs.

所以我这方面,我说,你需要做到以下几点:

So I this context, let's say you need to do the following:


  1. 加载与人姓名和地址的文件到RDD1集

  2. 加载与人的名字和电话的文件到RDD2

  3. 加入RDD1集和RDD2的名字,让RDD3

  4. 在RDD3地图得到一个不错的HTML presentation卡每个人的RDD4

  5. 保存RDD4到文件。

  6. 地图RDD1集从地址解压缩zip codeS得到RDD5

  7. 在RDD5聚合得到的有多少人在每个拉链code作为RDD6
  8. 生活计数
  9. 收集RDD6并打印这些统计到标准输出。

因此​​,


  1. 的驱动程序这是整片code的,运行的所有8个步骤。

  2. 生产整个HTML卡上的步骤5中设置一个 工作(适用清楚,因为我们使用的是的保存的行动,而不是转换) 。同样用的收集第8步

  3. 其他步骤将被整理成 阶段,每个作业是阶段的序列的结果。对于简单的事情,工作可以有一个阶段,但需要重新分区数据(例如,在步骤3中的连接)或任何破坏数据的地方通常会导致多个阶段出现。你能想到的阶段作为产生中间结果,其可以实际上被持久计算。例如,我们可以,因为我们将使用它不止一次,避免重新计算持续RDD1集。

  4. 所有3以上基本上谈如何的逻辑的特定算法将被打破。相反, 任务是一个特殊的数据块的,将经过一个给定的阶段,在给定的执行者。

  1. The driver program is this entire piece of code, running all 8 steps.
  2. Producing the entire HTML card set on step 5 is a job (clear because we are using the save action, not a transformation). Same with the collect on step 8
  3. Other steps will be organized into stages, with each job being the result of a sequence of stages. For simple things a job can have a single stage, but the need to repartition data (for instance, the join on step 3) or anything that breaks the locality of the data usually causes more stages to appear. You can think of stages as computations that produce intermediate results, which can in fact be persisted. For instance, we can persist RDD1 since we'll be using it more than once, avoiding recomputation.
  4. All 3 above basically talk about how the logic of a given algorithm will be broken. In contrast, a task is a particular piece of data that will go through a given stage, on a given executor.

希望它使事情更清晰; - )

Hope it makes things clearer ;-)

这篇关于什么是星火的工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆