Mesos 上的独立 Spark 集群访问不同 Hadoop 集群中的 HDFS 数据 [英] Standalone Spark cluster on Mesos accessing HDFS data in a different Hadoop cluster

查看:31
本文介绍了Mesos 上的独立 Spark 集群访问不同 Hadoop 集群中的 HDFS 数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们有一个包含 275 个节点的 Hadoop 集群(55Tb 总内存,12000 个 VCore).这个集群与几个项目共享,我们有一个 YARN 队列分配给我们,但资源有限.

We have a Hadoop cluster with datanodes 275 nodes (55Tb total memory , 12000 VCores). This cluster is shared with couple of projects and we have a YARN queue assign to us with limited resources.

为了增强性能,我们正在考虑为我们的项目构建一个单独的 Spark 集群(在同一网络中的 Mesos 上)并访问 Hadoop 集群上的 HDFS 数据.

For enhanced performance, we are thinking about constructing a seperate Spark cluster for our project (on Mesos in the same network ) and access HDFS data on the Hadoop cluster.

如 Spark 文档中所述:https://spark.apache.org/docs/latest/spark-standalone.html#running-alongside-hadoop

As mentioned in Spark document : https://spark.apache.org/docs/latest/spark-standalone.html#running-alongside-hadoop

我的问题是:

  1. 这是否违背了 Hadoop 的理念:将计算转移到数据上"?

  1. Isn't this against the philosophy of Hadoop : "moving the computation to the data" ?

为了获得最佳性能,新的 Spark 集群需要多少节点?

For optimal performance how much of nodes we will require for the new Spark cluster?

--编辑--

  1. 我想知道数据加载是如何发生的.例如,如果我对表执行 SparkSQL 查询,它是否通过从 Hadoop 集群加载数据在 Mesos Spark 集群中创建 RDD,然后对生成的 RDD 进行处理?这个跨集群数据 IO 不会影响性能吗?由于通常在 YARN-Spark 设置中,RDD 和数据位于相同的节点中.

推荐答案

这是否违背了 Hadoop 的理念:将计算转移到数据上"?

Isn't this against the philosophy of Hadoop : "moving the computation to the data" ?

总的来说是的.特别是如果这些节点位于不同的数据中心.越近越好.现在,我读到他们在同一个网络中:

In general yes. Especially if these nodes are located in different data centers. The closer the better. Now, I read that they are in the same network:

在同一个网络中的 Mesos 上

on Mesos in the same network

测量机器之间的延迟.只有这样你才能判断它好不好.

Measure the latency between the machines. Only then you can judge whether it's good or not.

为了获得最佳性能,新的 Spark 集群需要多少节点?

For optimal performance how much of nodes we will require for the new Spark cluster?

最适合谁?这仅取决于您的用例.

Optimal for who? It only depends on your use case.

例如,如果我对表执行 SparkSQL 查询,它是否通过从 Hadoop 集群加载数据在 Mesos Spark 集群中创建 RDD,然后对生成的 RDD 进行处理?

For example, If I execute a SparkSQL query on a table , does it create RDDs in Mesos Spark cluster by loading data from Hadoop cluster, then do the processing on the generated RDDs ?

是的,虽然它不是一个固定的读取所有内容然后处理它"的过程——它不断地读取、处理然后写下部分结果",因为正如您可能猜到的那样,它无法加载 1 TB 的数据记忆.

Yes, although it's not a fixed process "read everything then process it" - it constantly reads, processes and then writes down the "partial results", because as you may guess, it can't load 1 TB of data in memory.

这个跨集群数据IO不会影响性能吗?由于通常在 YARN-Spark 设置中,RDD 和数据位于相同的节点中.

Doesn't this cross cluster data IO impact the performance ? Since normally in YARN-Spark setup the RDDs and Data are in the same nodes.

当然!然而,正如我已经提到的,如果你想要更精确的估计,你应该至少测量这个网络中节点之间的延迟——也许有些节点比其他节点更接近 HDFS 机器.

Definitely! However, as I mentioned already, you should measure at least the latency between the nodes in this network if you want a more precise estimate - maybe some nodes are closer to the HDFS machines than others.

没有测量(延迟、性能测试等)和对网络拓扑的仔细分析,这纯粹是推测.

Without measurements (be it latency, performance tests, etc.) and a careful analysis of the network topology it's pure speculation.

这篇关于Mesos 上的独立 Spark 集群访问不同 Hadoop 集群中的 HDFS 数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆