Spark 中的任务是什么?Spark worker 是如何执行jar 文件的? [英] What is a task in Spark? How does the Spark worker execute the jar file?

查看:44
本文介绍了Spark 中的任务是什么?Spark worker 是如何执行jar 文件的?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

阅读有关 http://spark.apache.org/docs/0.8 的一些文档后.0/cluster-overview.html,我有一些问题想澄清一下.

After reading some document on http://spark.apache.org/docs/0.8.0/cluster-overview.html, I got some question that I want to clarify.

以 Spark 为例:

Take this example from Spark:

JavaSparkContext spark = new JavaSparkContext(
  new SparkConf().setJars("...").setSparkHome....);
JavaRDD<String> file = spark.textFile("hdfs://...");

// step1
JavaRDD<String> words =
  file.flatMap(new FlatMapFunction<String, String>() {
    public Iterable<String> call(String s) {
      return Arrays.asList(s.split(" "));
    }
  });

// step2
JavaPairRDD<String, Integer> pairs =
  words.map(new PairFunction<String, String, Integer>() {
    public Tuple2<String, Integer> call(String s) {
      return new Tuple2<String, Integer>(s, 1);
    }
  });

// step3
JavaPairRDD<String, Integer> counts =
  pairs.reduceByKey(new Function2<Integer, Integer>() {
    public Integer call(Integer a, Integer b) {
      return a + b;
    }
  });

counts.saveAsTextFile("hdfs://...");

所以假设我有 3 个节点集群,节点 1 作为主节点运行,并且上面的驱动程序已经正确 jared(比如 application-test.jar).所以现在我在主节点上运行这段代码,我相信在创建 SparkContext 之后,application-test.jar 文件将被复制到工作节点(每个工作节点都会创建一个该应用程序的目录).

So let's say I have 3 nodes cluster, and node 1 running as master, and the above driver program has been properly jared (say application-test.jar). So now I'm running this code on the master node and I believe right after the SparkContext being created, the application-test.jar file will be copied to the worker nodes (and each worker will create a dir for that application).

所以现在我的问题是:示例任务中的 step1、step2 和 step3 是否发送给工作人员?如果是,那么工人如何执行呢?像java -cp "application-test.jar" step1 等等?

So now my question: Are step1, step2 and step3 in the example tasks that get sent over to the workers? If yes, then how does the worker execute that? Like java -cp "application-test.jar" step1 and so on?

推荐答案

当您创建 SparkContext 时,每个 worker 都会启动一个 executor.这是一个单独的进程 (JVM),它也会加载您的 jar.执行程序连接回您的驱动程序.现在,驱动程序可以向它们发送命令,例如示例中的 flatMapmapreduceByKey.当驱动程序退出时,执行程序关闭.

When you create the SparkContext, each worker starts an executor. This is a separate process (JVM), and it loads your jar too. The executors connect back to your driver program. Now the driver can send them commands, like flatMap, map and reduceByKey in your example. When the driver quits, the executors shut down.

RDD 有点像被分成多个分区的大数组,每个 executor 可以保存其中的一些分区.

RDDs are sort of like big arrays that are split into partitions, and each executor can hold some of these partitions.

task 是通过序列化您的 Function 对象从驱动程序发送到执行程序的命令.执行器反序列化命令(这是可能的,因为它已经加载了你的 jar),并在一个分区上执行它.

A task is a command sent from the driver to an executor by serializing your Function object. The executor deserializes the command (this is possible because it has loaded your jar), and executes it on a partition.

(这是一个概念概述.我掩盖了一些细节,但我希望它会有所帮助.)

回答您的具体问题:不,不会为每个步骤启动一个新流程.当 SparkContext 被构建时,一个新进程会在每个 worker 上启动.

To answer your specific question: No, a new process is not started for each step. A new process is started on each worker when the SparkContext is constructed.

这篇关于Spark 中的任务是什么?Spark worker 是如何执行jar 文件的?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆