Hadoop的输入分割 - 它是如何工作的 [英] Hadoop's input spliting- How does it work

查看:114
本文介绍了Hadoop的输入分割 - 它是如何工作的的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我简要了解了



确切地说,我想知道它是如何分割/分割输入文件的。

它是以大小相等的方式分成的吗?

或者它是可配置的东西。



任何意见都会有帮助。



我确实经历了这个 post ,但我无法理解

解决方案

<这取决于InputFormat,对于大多数基于文件的格式,它在 FileInputFormat 基类中定义。



<有一些configurab le选项,它表示hadoop将如何处理单个文件并将其作为单个分割进行处理,或者将文件分割为多个分割:


  • 如果输入文件被压缩,输入格式和压缩方法必须是可拆分的。 Gzip例如不可拆分(您不能随意寻找文件中的某个点并恢复压缩的流)。 BZip2是可拆分的。有关更多信息,请参阅具体的 InputFormat.isSplittable()实现以获取更多信息。 如果文件大小小于或等于其定义的HDFS块大小,那么hadoop很可能会在一次拆分中对其进行处理(可以对其进行配置,稍后请参阅拆分大小属性)
  • 如果文件大小大于其定义的HDFS块大小,那么hadoop将很可能将文件划分为基于块的块(4个块将导致4个分块)。
  • 可以配置两个属性 mapred.min.split.size mapred.max.split.size ,它们在将块拆分为拆分时帮助输入格式。请注意,最小尺寸可能会被输入格式(可能有固定的最小输入尺寸)覆盖。



如果你想知道更多,并且很容易查看源代码,请查看 FileInputFormat 中的 getSplits()方法(new和旧的API有相同的方法,但它们可能有一些差异)。


I know brief about hadoop

I am curious to know how does it work.

To be precise I want to know, how exactly it divides/splits the input file.

Does it divides in equal chunks in terms of size?

or it is configurable thing.

Any comments would be helpful.

I did go through this post, but i could n't understand

解决方案

This is dependent on the InputFormat, which for most file-based formats is defined in the FileInputFormat base class.

There are a number of configurable options which denote how hadoop will take a single file and either process it as a single split, or divide the file into multiple splits:

  • If the input file is compressed, the input format and compression method must be splittable. Gzip for example is not splittable (you can't randomly seek to a point in the file and recover the compressed stream). BZip2 is splittable. See the specific InputFormat.isSplittable() implementation for your input format for more information
  • If the file size is less than or equal to its defined HDFS block size, then hadoop will most probably process it in a single split (this can be configured, see a later point about split size properties)
  • If the file size is greater than its defined HDFS block size, then hadoop will most probably divide up the file into splits based upon the underlying blocks (4 blocks would result in 4 splits)
  • You can configure two properties mapred.min.split.size and mapred.max.split.size which help the input format when breaking up blocks into splits. Note that the minimum size may be overriden by the input format (which may have a fixed minumum input size)

If you want to know more, and are comfortable looking through the source, check out the getSplits() method in FileInputFormat (both the new and old api have the same method, but they may have some suttle differences).

这篇关于Hadoop的输入分割 - 它是如何工作的的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆