Spark partition(ing) 如何处理 HDFS 中的文件? [英] How does Spark partition(ing) work on files in HDFS?

查看:36
本文介绍了Spark partition(ing) 如何处理 HDFS 中的文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 HDFS 在集群上使用 Apache Spark.据我了解,HDFS 是在数据节点上分发文件.因此,如果将file.txt"放在文件系统上,它将被分成多个分区.现在我打电话给

I'm working with Apache Spark on a Cluster using HDFS. As far as I understand, HDFS is distributing files on data-nodes. So if a put a "file.txt" on the filesystem, it will be split into partitions. Now I'm calling

rdd = SparkContext().textFile("hdfs://.../file.txt") 

来自 Apache Spark.rdd 现在是否自动与文件系统上的file.txt"具有相同的分区?当我调用

from Apache Spark. Has rdd now automatically the same partitions as "file.txt" on the filesystem? What happens when I call

rdd.repartition(x)

where x > 那么 hdfs 使用的分区?Spark 会在物理上重新排列 hdfs 上的数据以在本地工作吗?

where x > then the partitions used by hdfs? Will Spark physically rearrange the data on hdfs to work locally?

示例:我在 HDFS 系统上放置了一个 30GB 的文本文件,它将它分发到 10 个节点上.将火花a) 使用相同的 10 个分区?和 b) 当我调用 repartition(1000) 时在集群中混洗 30GB?

Example: I put a 30GB Textfile on the HDFS-System, which is distributing it on 10 nodes. Will Spark a) use the same 10 partitons? and b) shuffle 30GB across the cluster when I call repartition(1000)?

推荐答案

当 Spark 从 HDFS 读取文件时,它会为单个输入拆分创建单个分区.输入拆分由用于读取此文件的 Hadoop InputFormat 设置.例如,如果您使用 textFile() 它将是 Hadoop 中的 TextInputFormat,它将为您返回单个 HDFS 块的单个分区(但分区之间的拆分将行拆分,而不是确切的块拆分),除非您有一个压缩的文本文件.如果是压缩文件,您将获得单个文件的单个分区(因为压缩文本文件不可拆分).

When Spark reads a file from HDFS, it creates a single partition for a single input split. Input split is set by the Hadoop InputFormat used to read this file. For instance, if you use textFile() it would be TextInputFormat in Hadoop, which would return you a single partition for a single block of HDFS (but the split between partitions would be done on line split, not the exact block split), unless you have a compressed text file. In case of compressed file you would get a single partition for a single file (as compressed text files are not splittable).

当您调用 rdd.repartition(x) 时,它会将您在 rdd 中的 N 个分区中的数据洗牌到 x 你想要的分区,分区将在循环的基础上完成.

When you call rdd.repartition(x) it would perform a shuffle of the data from N partititons you have in rdd to x partitions you want to have, partitioning would be done on round robin basis.

如果您有一个 30GB 的未压缩文本文件存储在 HDFS 上,那么在默认的 HDFS 块大小设置(128MB)下,它将存储在 235 个块中,这意味着您从该文件中读取的 RDD 将有 235 个分区.当你调用 repartition(1000) 时,你的 RDD 会被标记为 to be repartitioned,但实际上只有当你在上面执行一个动作时它才会被改组到 1000 个分区这个RDD(懒惰执行概念)

If you have a 30GB uncompressed text file stored on HDFS, then with the default HDFS block size setting (128MB) it would be stored in 235 blocks, which means that the RDD you read from this file would have 235 partitions. When you call repartition(1000) your RDD would be marked as to be repartitioned, but in fact it would be shuffled to 1000 partitions only when you will execute an action on top of this RDD (lazy execution concept)

这篇关于Spark partition(ing) 如何处理 HDFS 中的文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆