spark + hadoop数据局部性 [英] spark + hadoop data locality

查看:170
本文介绍了spark + hadoop数据局部性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我得到了RDD的文件名,所以RDD [String]。我通过并行化文件名列表(hdfs中的文件)来获得。



现在我映射这个rdd,我的代码使用FileSystem.open(path)打开一个hadoop流, 。然后我处理它。



当我运行我的任务时,我使用spark UI / Stages,并且我看到所有任务的Locality Level=PROCESS_LOCAL。我不认为spark可能实现数据局部性(我在4个数据节点的集群上运行该任务的方式),这怎么可能?

解决方案

当在Spark任务中执行 FileSystem.open(path)时,File
内容将在同一个JVM进程中加载​​到本地变量,准备
RDD(分区)。因此该RDD的数据局部总是
PROCESS_LOCAL



- - vanekjar
已被评论过的问题







有关 Spark中的数据局部性

根据数据当前位置。从最近到最远的顺序:


  • PROCESS_LOCAL 数据与运行代码位于同一个JVM中。这是可能的最佳位置

  • NODE_LOCAL 数据位于同一个节点上。示例可能位于同一节点上的HDFS中,也可能位于同一节点上的另一个执行器中。这比PROCESS_LOCAL稍慢,因为数据必须在进程之间传输

  • NO_PREF 数据从任何地方以相同的速度快速访问,并且没有本地偏好

  • RACK_LOCAL 数据位于同一台服务器上。数据位于同一机架上的不同服务器上,因此需要通过网络发送,通常通过一台交换机

  • ANY 数据位于网络的其他位置,不在同一个机架中



Spark更喜欢将所有任务安排在最佳地点级别,但这并不总是可能的即可。在任何空闲执行器上没有未处理数据的情况下,Spark会切换到较低的地点级别。


I got an RDD of filenames, so an RDD[String]. I get that by parallelizing a list of filenames (of files inside hdfs).

Now I map this rdd and my code opens a hadoop stream using FileSystem.open(path). Then I process it.

When I run my task, I use spark UI/Stages and I see the "Locality Level" = "PROCESS_LOCAL" for all the tasks. I don't think spark could possibly achieve data locality the way I run the task (on a cluster of 4 data nodes), how is that possible?

解决方案

When FileSystem.open(path) gets executed in Spark tasks, File content will be loaded to local variable in same JVM process and prepares the RDD ( partition(s) ). so the data locality for that RDD is always PROCESS_LOCAL

-- vanekjar has already commented the on question


Additional information about data locality in Spark:

There are several levels of locality based on the data’s current location. In order from closest to farthest:

  • PROCESS_LOCAL data is in the same JVM as the running code. This is the best locality possible
  • NODE_LOCAL data is on the same node. Examples might be in HDFS on the same node, or in another executor on the same node. This is a little slower than PROCESS_LOCAL because the data has to travel between processes
  • NO_PREF data is accessed equally quickly from anywhere and has no locality preference
  • RACK_LOCAL data is on the same rack of servers. Data is on a different server on the same rack so needs to be sent over the network, typically through a single switch
  • ANY data is elsewhere on the network and not in the same rack

Spark prefers to schedule all tasks at the best locality level, but this is not always possible. In situations where there is no unprocessed data on any idle executor, Spark switches to lower locality levels.

这篇关于spark + hadoop数据局部性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆