Apache Spark架构 [英] Apache Spark architecture
问题描述
试图找到有关Apache Spark内部架构的完整文档,但没有结果.
Trying to find a complete documentation about an internal architecture of Apache Spark, but have no results there.
例如,我试图了解下一件事情: 假设我们在HDFS上有1Tb文本文件(群集中有3个节点,复制因子为1).该文件将被拆分为128Mb的块,每个块将仅存储在一个节点上.我们在这些节点上运行Spark Workers.我知道Spark正在尝试使用同一节点上HDFS中存储的数据(以避免网络I/O).例如,我正在尝试在此1Tb文本文件中进行字数统计.
For example I'm trying to understand next thing: Assume that we have 1Tb text file on HDFS (3 nodes in a cluster, replication factor is 1). This file will be spitted into 128Mb chunks and each chunk will be stored only on one node. We run Spark Workers on these nodes. I know that Spark is trying to work with data stored in HDFS on the same node (to avoid network I/O). For example I'm trying to do a word count in this 1Tb text file.
这里我还有其他问题:
- Spark是否会将卡盘(128Mb)加载到RAM中,对字进行计数,然后将其从内存中删除并按顺序执行?如果没有可用的RAM怎么办?
- Spark何时不使用HDFS上的本地数据?
- 如果我需要执行更复杂的任务,当每个工作人员的每次迭代结果都需要转移到所有其他工作人员(改组吗?)时,我需要自己将其写入HDFS,然后读取他们?例如,我无法理解K均值聚类或Gradient descent在Spark上如何工作.
任何与Apache Spark体系结构指南的链接,我将不胜感激.
I will appreciate any link to Apache Spark architecture guide.
推荐答案
添加到其他答案中,在这里我想包括问题中提到的Spark核心架构图.
Adding to other answers, here I would like to include Spark core architecture diagram as it was mentioned in the question.
大师是此处的入口点.
这篇关于Apache Spark架构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!