关于 Hadoop/HDFS 文件拆分 [英] About Hadoop/HDFS file splitting

查看:67
本文介绍了关于 Hadoop/HDFS 文件拆分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

想确认以下.请验证这是否正确:1.根据我的理解,当我们将文件复制到 HDFS 时,就是将文件(假设其大小 > 64MB = HDFS 块大小)拆分为多个块,每个块存储在不同的数据节点上.

Want to just confirm on following. Please verify if this is correct: 1. As per my understanding when we copy a file into HDFS, that is the point when file (assuming its size > 64MB = HDFS block size) is split into multiple chunks and each chunk is stored on different data-nodes.

  1. 当文件被复制到 HDFS 时,文件内容已经被拆分成块,并且在运行映射作业时不会发生文件拆分.地图任务仅以这样一种方式调度,即它们在最大的每个块上工作.大小为 64 MB,具有数据局部性(即映射任务在包含数据/块的节点上运行)

  1. File contents are already split into chunks when file is copied into HDFS and that file-split does not happen at the time of running map job. Map tasks are only scheduled in such a way that they work on each chunk of max. size 64 MB with data-locality (i.e. map task runs on that node which contains the data/chunk)

如果文件被压缩(gzipped),文件拆分也会发生,但 MR 确保每个文件只由一个映射器处理,即 MR 将收集位于其他数据节点的所有 gzip 文件块,并将它们全部提供给单个映射器.

File splitting also happens if file is compressed (gzipped) but MR ensures that each file is processed by just one mapper, i.e. MR will collect all the chunks of gzip file lying at other data nodes and give them all to the single mapper.

如果我们定义 isSplitable() 返回 false,就会发生与上面相同的事情,即文件的所有块将由在一台机器上运行的一个映射器处理.MR 将从不同的数据节点读取文件的所有块,并将它们提供给单个映射器.

Same thing as above will happen if we define isSplitable() to return false, i.e. all the chunks of a file will be processed by one mapper running on one machine. MR will read all the chunks of a file from different data-nodes and make them available to a single mapper.

推荐答案

你的理解并不理想.我要指出有两个几乎独立的进程:将文件拆分为 HDFS 块,以及拆分文件以供不同映射器处理.
HDFS 根据定义的块大小将文件拆分为块.
每种输入格式都有自己的逻辑,即如何将文件拆分为由不同映射器独立处理的部分.FileInputFormat 的默认逻辑是按 HDFS 块拆分文件.您可以实现任何其他逻辑
压缩,通常是分裂的敌人,所以我们采用块压缩技术来实现压缩数据的分裂.这意味着文件(块)的每个逻辑部分都是独立压缩的.

Your understanding is not ideal. I would point out that there are two, almost independent processes: splitting files into HDFS blocks, and splitting files for processing by the different mappers.
HDFS split files into blocks based on the defined block size.
Each input format has its own logic how files can be split into part for the independent processing by different mappers. Default logic of the FileInputFormat is to split file by HDFS blocks. You can implement any other logic
Compression, usually is a foe of the splitting, so we employ block compression technique to enable splitting of the compressed data. It means that each logical part of the file (block) is compressed independently.

这篇关于关于 Hadoop/HDFS 文件拆分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆