spark + hadoop数据局部性 [英] spark + hadoop data locality
问题描述
我得到了RDD的文件名,所以RDD [String]。我通过并行化文件名列表(hdfs中的文件)来获得。
现在我映射这个rdd,我的代码使用FileSystem.open(path)打开一个hadoop流, 。然后我处理它。
当我运行我的任务时,我使用spark UI / Stages,并且我看到所有任务的Locality Level=PROCESS_LOCAL。我不认为spark可能实现数据局部性(我在4个数据节点的集群上运行该任务的方式),这怎么可能?
当在Spark任务中执行 FileSystem.open(path)
时,File
内容将在同一个JVM进程中加载到本地变量,准备
RDD(分区)。因此该RDD的数据局部总是
PROCESS_LOCAL
- - vanekjar 有
已被评论过的问题
有关 Spark中的数据局部性: 根据数据当前位置。从最近到最远的顺序: Spark更喜欢将所有任务安排在最佳地点级别,但这并不总是可能的即可。在任何空闲执行器上没有未处理数据的情况下,Spark会切换到较低的地点级别。 I got an RDD of filenames, so an RDD[String]. I get that by parallelizing a list of filenames (of files inside hdfs). Now I map this rdd and my code opens a hadoop stream using FileSystem.open(path). Then I process it. When I run my task, I use spark UI/Stages and I see the "Locality Level" = "PROCESS_LOCAL" for all the tasks. I don't think spark could possibly achieve data locality the way I run the task (on a cluster of 4 data nodes), how is that possible? When -- vanekjar has
already commented the on question
Additional information about data locality in Spark: There are several levels of locality based on the data’s current location. In order from closest to farthest: Spark prefers to schedule all tasks at the best locality level, but this is not always possible. In situations where there is no unprocessed data on any idle executor, Spark switches to lower locality levels. 这篇关于spark + hadoop数据局部性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
FileSystem.open(path)
gets executed in Spark tasks, File
content will be loaded to local variable in same JVM process and prepares
the RDD ( partition(s) ). so the data locality for that RDD is always
PROCESS_LOCAL