Mesos上的独立Spark集群访问其他Hadoop集群中的HDFS数据 [英] Standalone Spark cluster on Mesos accessing HDFS data in a different Hadoop cluster

查看:106
本文介绍了Mesos上的独立Spark集群访问其他Hadoop集群中的HDFS数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们有一个Hadoop群集,其中的数据节点有275个节点(总内存55Tb,12000个VCore).该集群与几个项目共享,并且我们在资源有限的情况下为我们分配了YARN队列.

We have a Hadoop cluster with datanodes 275 nodes (55Tb total memory , 12000 VCores). This cluster is shared with couple of projects and we have a YARN queue assign to us with limited resources.

为了提高性能,我们正在考虑为我们的项目构建一个单独的Spark集群(在同一网络中的Mesos上),并在Hadoop集群上访问HDFS数据.

For enhanced performance, we are thinking about constructing a seperate Spark cluster for our project (on Mesos in the same network ) and access HDFS data on the Hadoop cluster.

如Spark文档中所述: https://spark.apache.org/docs/latest/spark-standalone.html#running-alongside-hadoop

As mentioned in Spark document : https://spark.apache.org/docs/latest/spark-standalone.html#running-alongside-hadoop

我的问题是:

  1. 这是否违反Hadoop的哲学:将计算移至数据"?

  1. Isn't this against the philosophy of Hadoop : "moving the computation to the data" ?

为获得最佳性能,新的Spark集群需要多少个节点?

For optimal performance how much of nodes we will require for the new Spark cluster?

-编辑-

  1. 我想知道这种数据加载是如何发生的.例如,如果我在一个表上执行SparkSQL查询,它是否通过从Hadoop集群加载数据在Mesos Spark集群中创建RDD,然后对生成的RDD进行处理?这种跨集群数据IO不会影响性能吗?由于通常在YARN-Spark设置中,RDD和数据位于相同的节点中.

推荐答案

这是否违反Hadoop的哲学:将计算移至数据"?

Isn't this against the philosophy of Hadoop : "moving the computation to the data" ?

通常是.特别是如果这些节点位于不同的数据中心中.越近越好.现在,我了解到它们位于同一网络中:

In general yes. Especially if these nodes are located in different data centers. The closer the better. Now, I read that they are in the same network:

在同一网络中的Mesos上

on Mesos in the same network

测量机器之间的等待时间.只有这样,您才能判断它是否好.

Measure the latency between the machines. Only then you can judge whether it's good or not.

为获得最佳性能,新的Spark集群需要多少个节点?

For optimal performance how much of nodes we will require for the new Spark cluster?

最适合谁?这仅取决于您的用例.

Optimal for who? It only depends on your use case.

例如,如果我在表上执行SparkSQL查询,它是否通过从Hadoop集群加载数据在Mesos Spark集群中创建RDD,然后对生成的RDD进行处理?

For example, If I execute a SparkSQL query on a table , does it create RDDs in Mesos Spark cluster by loading data from Hadoop cluster, then do the processing on the generated RDDs ?

是的,虽然它不是固定的过程,但先读取所有内容然后进行处理"-它会不断读取,处理然后写下部分结果",因为您可能会猜到,它无法加载1 TB的数据记忆.

Yes, although it's not a fixed process "read everything then process it" - it constantly reads, processes and then writes down the "partial results", because as you may guess, it can't load 1 TB of data in memory.

此跨集群数据IO不会影响性能吗?由于通常在YARN-Spark设置中,RDD和数据位于相同的节点中.

Doesn't this cross cluster data IO impact the performance ? Since normally in YARN-Spark setup the RDDs and Data are in the same nodes.

绝对!但是,正如我已经提到的,如果要进行更精确的估计,则至少应测量该网络中节点之间的等待时间-也许某些节点比其他节点更靠近HDFS计算机.

Definitely! However, as I mentioned already, you should measure at least the latency between the nodes in this network if you want a more precise estimate - maybe some nodes are closer to the HDFS machines than others.

没有测量(例如延迟,性能测试等),也没有仔细分析网络拓扑,这纯粹是推测.

Without measurements (be it latency, performance tests, etc.) and a careful analysis of the network topology it's pure speculation.

这篇关于Mesos上的独立Spark集群访问其他Hadoop集群中的HDFS数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆