阿帕奇星火:驱动程序(而不只是执行人)试图连接到卡桑德拉 [英] Apache Spark: Driver (instead of just the Executors) tries to connect to Cassandra

查看:201
本文介绍了阿帕奇星火:驱动程序(而不只是执行人)试图连接到卡桑德拉的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想我还没有完全理解星火如何工作的。

I guess I'm not yet fully understanding how Spark works.

下面是我的设置:

我运行在单机模式下的星火集群。我使用的是4机这样的:一个是法师,另外三个是工人

I'm running a Spark cluster in Standalone mode. I'm using 4 machines for this: One is the Master, the other three are Workers.

我写了从卡桑德拉集群中读取数据的应用程序(请参阅https://github.com/journeymonitor/analyze/blob/master/spark/src/main/scala/SparkApp.scala#L118).

I have written an application that reads data from a Cassandra cluster (see https://github.com/journeymonitor/analyze/blob/master/spark/src/main/scala/SparkApp.scala#L118).

三节点卡桑德拉群集上还举办星火工作节点相同的机器上运行。星火主节点不运行卡桑德拉节点:

The 3-nodes Cassandra cluster runs on the same machines that also host the Spark Worker nodes. The Spark Master node does not run a Cassandra node:

Machine 1      Machine 2        Machine 3        Machine 4
Spark Master   Spark Worker     Spark Worker     Spark Worker
               Cassandra node   Cassandra node   Cassandra node

这背后的原因是我要优化数据局部性 - 在集群上运行我的星火应用程序时,每个工人只需要与当地卡桑德拉节点

The reasoning behind this is that I want to optimize data locality - when running my Spark app on the cluster, each Worker only needs to talk to its local Cassandra node.

现在,通过运行提交我的星火应用群集时火花提交--deploy模式客户端--master火花(星火硕士),我希望以下内容:

Now, when submitting my Spark app to the cluster by running spark-submit --deploy-mode client --master spark://machine-1 from Machine 1 (the Spark Master), I expect the following:


  • 驱动程序实例启动的星火主

  • 驱动程序启动每个星火一名遗嘱执行人工人

  • 驱动程序我的应用程序分发到每个执行人

  • 我的应用程序运行在每个执行者,并从那里,会谈卡桑德拉通过 127.0.0.1:9042

  • a Driver instance is started on the Spark Master
  • the Driver starts one Executor on each Spark Worker
  • the Driver distributes my application to each Executor
  • my application runs on each Executor, and from there, talks to Cassandra via 127.0.0.1:9042

然而,这似乎不是这种情况。相反,星火法师试图跟卡桑德拉(和失败,因为在机器1台主机上没有卡桑德拉节点)。

However, this doesn't seem to be the case. Instead, the Spark Master tries to talk to Cassandra (and fails, because there is no Cassandra node on the Machine 1 host).

它是什么,我误会?它的工作方式不同?确实在事实上驱动读取卡桑德拉中的数据,并分发该数据到执行人?但后来我也从来没有读过比机器1 ,即使我的群集的总内存就足够了。

What is it that I misunderstand? Does it work differently? Does in fact the Driver read the data from Cassandra, and distribute the data to the Executors? But then I could never read data larger than memory of Machine 1, even if the total memory of my cluster is sufficient.

或者,难道司机谈话卡桑德拉不读取数据,但要找出如何将数据分区,并指示执行人阅读他们的部分数据?

Or, does the Driver talk to Cassandra not to read data, but to find out how to partition the data, and instructs the Executors to read "their" part of the data?

如果有人能微启我,那会是多少AP preciated。

If someone can enlight me, that would be much appreciated.

推荐答案

驱动程序负责的工作节点上创建SparkContext,SQLContext和调度任务。它包括创建逻辑和物理规划和应用优化。为了能够做到这一点它必须有访问数据源模式和可能的其它信息样的架构或不同的统计数据。实施细节从源头而异源,但总体来说这意味着数据应包括申请硕士的所有节点上访问。

Driver program is responsible for creating SparkContext, SQLContext and scheduling tasks on the worker nodes. It includes creating logical and physical plans and applying optimizations. To be able to do that it has to have access to the data source schema and possible other informations like schema or different statistics. Implementation details vary from source to source but generally speaking it means that data should be accessible on all nodes including application master.

在一天结束的时候你的期望几乎是正确的。数据块是在每个工人分别取不通过的驱动程序走,但司机必须能够连接到卡桑德拉获取所需的元数据。

At the end of the day your expectations are almost correct. Chunks of the data are fetched individually on each worker without going through driver program, but driver has to be able to connect to Cassandra to fetch required metadata.

这篇关于阿帕奇星火:驱动程序(而不只是执行人)试图连接到卡桑德拉的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆