Spark分区(ing)如何在HDFS中的文件上工作? [英] How does Spark partition(ing) work on files in HDFS?

查看:245
本文介绍了Spark分区(ing)如何在HDFS中的文件上工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用HDFS的集群上使用Apache Spark。据我所知,HDFS正在数据节点上分发文件。所以如果在文件系统上放置一个file.txt,它将被分割成多个分区。
现在我打电话给
$ b $ pre $ rdd = SparkContext()。textFile(hdfs://.../文件.txt)

来自Apache Spark。
现在rdd自动将与文件系统上的file.txt相同的分区?
当我打电话时会发生什么

  rdd.repartition(x)

其中x>那么hdfs使用的分区?例如:
我将一个30GB的文本文件放在HDFS系统上,它将它分发到10个节点。
将Spark
a)使用相同的10个分区吗?和b)当我调用重新分区(1000)时,在整个集群中洗牌30GB?

解决方案

当Spark从HDFS读取文件时,为单个输入分割创建单个分区。输入拆分由用于读取此文件的Hadoop InputFormat 设置。例如,如果您使用 textFile(),它将在Hadoop中为 TextInputFormat ,这会返回一个单独的分区一个HDFS块(但分割之间的分割将在行分割上完成,而不是确切的块分割),除非你有一个压缩的文本文件。在压缩文件的情况下,您将为单个文件获得单个分区(因为压缩文本文件不可拆分)。



当您调用 rdd .repartition(x)它会执行你在 rdd N 部分数据的混洗$ c>到 x 您想要的分区,分区将在循环基础上完成。

如果你有一个存储在HDFS上的30GB未压缩文本文件,然后使用默认的HDFS块大小设置(128MB)将其存储在235个块中,这意味着从该文件读取的RDD将具有235个分区。当您调用重新分区(1000)时,您的RDD将被标记为以进行重新分区,但事实上,只有当您将在这个RDD之上执行一个动作(懒惰执行概念)


I'm working with Apache Spark on a Cluster using HDFS. As far as I understand, HDFS is distributing files on data-nodes. So if a put a "file.txt" on the filesystem, it will be split into partitions. Now I'm calling

rdd = SparkContext().textFile("hdfs://.../file.txt") 

from Apache Spark. Has rdd now automatically the same partitions as "file.txt" on the filesystem? What happens when I call

rdd.repartition(x)

where x > then the partitions used by hdfs? Will Spark physically rearrange the data on hdfs to work locally?

Example: I put a 30GB Textfile on the HDFS-System, which is distributing it on 10 nodes. Will Spark a) use the same 10 partitons? and b) shuffle 30GB across the cluster when I call repartition(1000)?

解决方案

When Spark reads a file from HDFS, it creates a single partition for a single input split. Input split is set by the Hadoop InputFormat used to read this file. For instance, if you use textFile() it would be TextInputFormat in Hadoop, which would return you a single partition for a single block of HDFS (but the split between partitions would be done on line split, not the exact block split), unless you have a compressed text file. In case of compressed file you would get a single partition for a single file (as compressed text files are not splittable).

When you call rdd.repartition(x) it would perform a shuffle of the data from N partititons you have in rdd to x partitions you want to have, partitioning would be done on round robin basis.

If you have a 30GB uncompressed text file stored on HDFS, then with the default HDFS block size setting (128MB) it would be stored in 235 blocks, which means that the RDD you read from this file would have 235 partitions. When you call repartition(1000) your RDD would be marked as to be repartitioned, but in fact it would be shuffled to 1000 partitions only when you will execute an action on top of this RDD (lazy execution concept)

这篇关于Spark分区(ing)如何在HDFS中的文件上工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆