Apache Spark架构 [英] Apache Spark architecture

查看：77 发布时间：2020/9/4 5:54:39 apache-spark hdfs bigdata

本文介绍了Apache Spark架构的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

试图找到有关Apache Spark内部架构的完整文档，但没有结果.

Trying to find a complete documentation about an internal architecture of Apache Spark, but have no results there.

例如，我试图了解下一件事情: 假设我们在HDFS上有1Tb文本文件(群集中有3个节点，复制因子为1).该文件将被拆分为128Mb的块，每个块将仅存储在一个节点上.我们在这些节点上运行Spark Workers.我知道Spark正在尝试使用同一节点上HDFS中存储的数据(以避免网络I/O).例如，我正在尝试在此1Tb文本文件中进行字数统计.

For example I'm trying to understand next thing: Assume that we have 1Tb text file on HDFS (3 nodes in a cluster, replication factor is 1). This file will be spitted into 128Mb chunks and each chunk will be stored only on one node. We run Spark Workers on these nodes. I know that Spark is trying to work with data stored in HDFS on the same node (to avoid network I/O). For example I'm trying to do a word count in this 1Tb text file.

这里我还有其他问题:

Spark是否会将卡盘(128Mb)加载到RAM中，对字进行计数，然后将其从内存中删除并按顺序执行?如果没有可用的RAM怎么办?
Spark何时不使用HDFS上的本地数据?
如果我需要执行更复杂的任务，当每个工作人员的每次迭代结果都需要转移到所有其他工作人员(改组吗?)时，我需要自己将其写入HDFS，然后读取他们?例如，我无法理解K均值聚类或Gradient descent在Spark上如何工作.

任何与Apache Spark体系结构指南的链接，我将不胜感激.

I will appreciate any link to Apache Spark architecture guide.

Apache Spark架构 [英] Apache Spark architecture

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Apache Spark架构 [英] Apache Spark architecture

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭