Hadoop的输入拆分-如何工作 [英] Hadoop's input splitting- How does it work

查看:87
本文介绍了Hadoop的输入拆分-如何工作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我了解有关帖子,但是我听不懂

解决方案

这取决于InputFormat,对于大多数基于文件的格式,该格式都在FileInputFormat基类中定义.

有许多可配置的选项,它们指示hadoop如何获取单个文件并将其作为单个拆分处理,或将文件拆分为多个拆分:

  • 如果输入文件被压缩,则输入格式和压缩方法必须是可拆分的.例如,Gzip是不可拆分的(您不能随机查找文件中的某个点并恢复压缩流). BZip2是可拆分的.有关更多信息,请参见特定于InputFormat.isSplittable()实现的输入格式
  • 如果文件大小小于或等于其定义的HDFS块大小,那么hadoop最有可能在单个拆分中对其进行处理(可以配置,请参阅稍后的拆分大小属性)
  • 如果文件大小大于其定义的HDFS块大小,那么hadoop最有可能根据基础块将文件划分为多个拆分(4个块将导致4个拆分)
  • 您可以配置两个属性mapred.min.split.sizemapred.max.split.size,这些属性在将块拆分为拆分时有助于输入格式.请注意,最小尺寸可能会被输入格式(可能具有固定的最小输入尺寸)覆盖

如果您想了解更多信息,并且可以轻松地查找源代码,请查看FileInputFormat中的getSplits()方法(新API和旧api都具有相同的方法,但可能会有一些差异).

I know brief about hadoop

I am curious to know how does it work.

To be precise I want to know, how exactly it divides/splits the input file.

Does it divides in equal chunks in terms of size?

or it is configurable thing.

I did go through this post, but I couldn't understand

解决方案

This is dependent on the InputFormat, which for most file-based formats is defined in the FileInputFormat base class.

There are a number of configurable options which denote how hadoop will take a single file and either process it as a single split, or divide the file into multiple splits:

  • If the input file is compressed, the input format and compression method must be splittable. Gzip for example is not splittable (you can't randomly seek to a point in the file and recover the compressed stream). BZip2 is splittable. See the specific InputFormat.isSplittable() implementation for your input format for more information
  • If the file size is less than or equal to its defined HDFS block size, then hadoop will most probably process it in a single split (this can be configured, see a later point about split size properties)
  • If the file size is greater than its defined HDFS block size, then hadoop will most probably divide up the file into splits based upon the underlying blocks (4 blocks would result in 4 splits)
  • You can configure two properties mapred.min.split.size and mapred.max.split.size which help the input format when breaking up blocks into splits. Note that the minimum size may be overriden by the input format (which may have a fixed minumum input size)

If you want to know more, and are comfortable looking through the source, check out the getSplits() method in FileInputFormat (both the new and old api have the same method, but they may have some suttle differences).

这篇关于Hadoop的输入拆分-如何工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆