Apache Spark:Driver(而不是只有Executors)尝试连接到Cassandra [英] Apache Spark: Driver (instead of just the Executors) tries to connect to Cassandra

查看:143
本文介绍了Apache Spark:Driver(而不是只有Executors)尝试连接到Cassandra的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想我尚未完全了解Spark的工作原理。



这是我的设置:



我在独立模式下运行Spark集群。我使用4台机器:一个是主人,另外三个是工人。



我写了一个应用程序,从Cassandra集群读取数据a href =https://github.com/journeymonitor/analyze/blob/master/spark/src/main/scala/SparkApp.scala#L118> https://github.com/journeymonitor/analyze/blob/master /spark/src/main/scala/SparkApp.scala#L118 )。



3节点Cassandra群集在同一台计算机上运行, Spark工作节点。 Spark主节点不运行Cassandra节点:

 机器1机器2机器3机器4 
Spark Master Spark工作者Spark工作者Spark工作者
Cassandra节点Cassandra节点Cassandra节点

我想优化数据局部性 - 当在集群上运行我的Spark应用程序时,每个Worker仅需要与其本地Cassandra节点通信。



现在,提交我的Spark应用程序通过从机器1(Spark Master)运行 spark-submit --deploy-mode client --master spark:// machine-1




  • 驱动程序实例在Spark Master上启动

  • 驱动程序在每个启动一个执行器Spark Worker

  • 驱动程序将我的应用程序分发给每个Executor

  • 我的应用程序在每个Executor上运行,从那里,通过 127.0.0.1:9042



但是,似乎不是这样。相反,Spark Master尝试与Cassandra交谈(并且失败,因为在Machine 1主机上没有Cassandra节点)。



我误解了什么?它工作不同吗?实际上,驱动程序从Cassandra读取数据,并将数据分发给执行者?但是,即使我的集群的总内存是足够的,我也永远不能读取大于<1 code>机器1的内存的数据。



或者,驱动程序与Cassandra交谈是否不读取数据,但是要找出如何分割数据,并指示执行器读取其数据部分?


$ b

解决方案

驱动程序负责创建SparkContext, SQLContext和调度任务在工作节点上。它包括创建逻辑和物理计划和应用优化。为了能够这样做,它必须访问数据源模式和可能的其他信息,如模式或不同的统计信息。实现细节因源而异,但一般来说,这意味着数据应该在所有节点(包括应用程序主机)上可访问。



在一天结束时,您的期望几乎是正确的。数据块在每个工作程序上单独获取,而不通过驱动程序,但驱动程序必须能够连接到Cassandra以获取所需的元数据。


I guess I'm not yet fully understanding how Spark works.

Here is my setup:

I'm running a Spark cluster in Standalone mode. I'm using 4 machines for this: One is the Master, the other three are Workers.

I have written an application that reads data from a Cassandra cluster (see https://github.com/journeymonitor/analyze/blob/master/spark/src/main/scala/SparkApp.scala#L118).

The 3-nodes Cassandra cluster runs on the same machines that also host the Spark Worker nodes. The Spark Master node does not run a Cassandra node:

Machine 1      Machine 2        Machine 3        Machine 4
Spark Master   Spark Worker     Spark Worker     Spark Worker
               Cassandra node   Cassandra node   Cassandra node

The reasoning behind this is that I want to optimize data locality - when running my Spark app on the cluster, each Worker only needs to talk to its local Cassandra node.

Now, when submitting my Spark app to the cluster by running spark-submit --deploy-mode client --master spark://machine-1 from Machine 1 (the Spark Master), I expect the following:

  • a Driver instance is started on the Spark Master
  • the Driver starts one Executor on each Spark Worker
  • the Driver distributes my application to each Executor
  • my application runs on each Executor, and from there, talks to Cassandra via 127.0.0.1:9042

However, this doesn't seem to be the case. Instead, the Spark Master tries to talk to Cassandra (and fails, because there is no Cassandra node on the Machine 1 host).

What is it that I misunderstand? Does it work differently? Does in fact the Driver read the data from Cassandra, and distribute the data to the Executors? But then I could never read data larger than memory of Machine 1, even if the total memory of my cluster is sufficient.

Or, does the Driver talk to Cassandra not to read data, but to find out how to partition the data, and instructs the Executors to read "their" part of the data?

If someone can enlight me, that would be much appreciated.

解决方案

Driver program is responsible for creating SparkContext, SQLContext and scheduling tasks on the worker nodes. It includes creating logical and physical plans and applying optimizations. To be able to do that it has to have access to the data source schema and possible other informations like schema or different statistics. Implementation details vary from source to source but generally speaking it means that data should be accessible on all nodes including application master.

At the end of the day your expectations are almost correct. Chunks of the data are fetched individually on each worker without going through driver program, but driver has to be able to connect to Cassandra to fetch required metadata.

这篇关于Apache Spark:Driver(而不是只有Executors)尝试连接到Cassandra的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆