Spark使用数据本地化？ [英] Does Spark use data locality?

查看：660 发布时间：2016/11/13 15:23:05 hadoop cassandra hbase apache-spark

本文介绍了Spark使用数据本地化？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想了解Apache Spark的内部结构。我不知道Spark是否使用一些机制来确保在从InputFormat读取或写入OutputFormat（或本地支持的，而不是从MapReduce派生的其他格式）数据本地化。

I'm trying to understand Apache Spark's internals. I wonder if Spark uses some mechanisms to ensure data locality when reading from InputFormat or writing to an OutputFormat (or other formats natively supported by Spark and not derived from MapReduce).

在第一种情况下（读），我的理解是，当使用InputFormat时，分割与包含数据的主机（或多个主机）相关联，因此Spark尝试将任务分配给执行器，以便减少网络传输

In the first case (reading), my understanding is that, when using InputFormat, the splits get associated with the host (or hosts??) that contain the data so Spark tries to assign tasks to executors in order to reduce network transfer as much as possible.

在写作的情况下，这种机制如何工作？我知道在技术上，HDFS中的文件可以保存在任何节点本地和复制到其他两个（所以你使用网络的三个副本中的两个），但是，如果你考虑写入其他系统，如NoSQL数据库Cassandra，HBase，其他..），这样的系统有自己的数据分发方式。有没有办法告诉spark分区一个RDD的方式，根据输出接收器（目标NoSQL数据库，看到本机或通过OutputFormat）期望的数据分布优化数据本地化？

In the case of writing, how such a mechanism would work? I know that technically, a file in HDFS can be saved in any node locally and replicated to other two (so you use the network for two out of 3 replicas), but, if you consider writing to other systems, such as NoSQL database (Cassandra, HBase, others.. ), such systems have their own way of distributing data. Is there a way to tell spark to partition an RDD in a way that optimize data locality on the basis of the distribution of data expected by the output sink (target NoSQL database, seen natively or through an OutputFormat) ?

我指的是Spark节点和NoSQL节点存在于同一个物理机器上的环境。

I refer to an environment in which Spark nodes and NoSQL nodes live in the same phisical machines.

Spark使用数据本地化？ [英] Does Spark use data locality?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark使用数据本地化？ [英] Does Spark use data locality?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭