Spark使用数据本地化? [英] Does Spark use data locality?

查看:660
本文介绍了Spark使用数据本地化?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想了解Apache Spark的内部结构。我不知道Spark是否使用一些机制来确保在从InputFormat读取或写入OutputFormat(或本地支持的,而不是从MapReduce派生的其他格式)数据本地化。

I'm trying to understand Apache Spark's internals. I wonder if Spark uses some mechanisms to ensure data locality when reading from InputFormat or writing to an OutputFormat (or other formats natively supported by Spark and not derived from MapReduce).

在第一种情况下(读),我的理解是,当使用InputFormat时,分割与包含数据的主机(或多个主机)相关联,因此Spark尝试将任务分配给执行器,以便减少网络传输

In the first case (reading), my understanding is that, when using InputFormat, the splits get associated with the host (or hosts??) that contain the data so Spark tries to assign tasks to executors in order to reduce network transfer as much as possible.

在写作的情况下,这种机制如何工作?我知道在技术上,HDFS中的文件可以保存在任何节点本地和复制到其他两个(所以你使用网络的三个副本中的两个),但是,如果你考虑写入其他系统,如NoSQL数据库Cassandra,HBase,其他..),这样的系统有自己的数据分发方式。有没有办法告诉spark分区一个RDD的方式,根据输出接收器(目标NoSQL数据库,看到本机或通过OutputFormat)期望的数据分布优化数据本地化?

In the case of writing, how such a mechanism would work? I know that technically, a file in HDFS can be saved in any node locally and replicated to other two (so you use the network for two out of 3 replicas), but, if you consider writing to other systems, such as NoSQL database (Cassandra, HBase, others.. ), such systems have their own way of distributing data. Is there a way to tell spark to partition an RDD in a way that optimize data locality on the basis of the distribution of data expected by the output sink (target NoSQL database, seen natively or through an OutputFormat) ?

我指的是Spark节点和NoSQL节点存在于同一个物理机器上的环境。

I refer to an environment in which Spark nodes and NoSQL nodes live in the same phisical machines.

推荐答案

如果您在同一台物理机器上使用Spark和Cassandra,则应该查看 spark-cassandra-连接器它将确保读取和写入的数据位置。

If you use Spark and Cassandra on the same physical machine, you should check out spark-cassandra-connector It will ensure data locality for both reads and writes.

例如,如果将Cassandra表装入RDD,连接器将始终尝试在每个节点上本地对此RDD执行操作。
当您将RDD保存到Cassandra中时,连接器也会尝试在本地保存结果。

For example, if you load a Cassandra table into an RDD, the connector will always try to do the operations on this RDD locally on each node. And when you save the RDD into Cassandra, the connector will also try to save results locally as well.

这假设您的数据已经在您的Cassandra簇。如果你的PartitionKey没有正确完成,你将会遇到一个不平衡的集群。

This assuming that your data is already balanced across your Cassandra cluster. If your PartitionKey is not done correctly, you will end up with an unbalanced cluster anyway.

还要注意Spark上的shuffling工作。例如,如果您在RDD上执行ReduceByKey,那么无论如何,您将通过网络传输数据。因此,请务必仔细计划这些工作。

Also be aware of shuffling jobs on Spark. For example, if you perform a ReduceByKey on an RDD, you'll end up streaming data across the network anyway. So, always plan these jobs carefully.

这篇关于Spark使用数据本地化?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆