Hadoop输入拆分为压缩块 [英] Hadoop input split for a compressed block

查看:101
本文介绍了Hadoop输入拆分为压缩块的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我有一个可拆分的1GB压缩文件,并且默认情况下块大小和输入拆分大小为128MB,则会创建8个块和8个输入拆分。当压缩块通过map reduce读取时,它是未压缩的,并且在解压缩之后块的大小变为200MB。但是这个分配的输入分割是128MB,那么82MB的剩余部分是如何处理的。


  1. 是否由下一个输入处理拆分?

  2. 是否增加了相同的输入拆分大小?


解决方案这里是我的理解:

让我们假设1 GB压缩数据= 2 GB解压缩数据
所以你有16块数据,Bzip2知道块边界作为bzip2文件在块之间提供同步标记。所以bzip2将数据分成16个块并将数据发送给16个映射器。每个映射器获得1个输入分解大小= 128 MB的解压缩数据大小。
(当然如果数据不是128 MB的倍数,最后的映射器将获得更少的数据)

If i have a compressed file of 1GB which is splittable and by default the block size and input split size is 128MB then there are 8 blocks created and 8 input split. When the compressed block is read by map reduce it is uncompressed and say after uncompression the size of the block becomes 200MB. But the input split for this assigned is of 128MB, so how is the rest of the 82MB processed.

  1. Is it processed by the next input split?
  2. Is the same input split size is increased?

解决方案

Here is my understanding:

Lets assume 1 GB compressed data = 2 GB decompressed data so you have 16 block of data, Bzip2 knows the block boundary as a bzip2 file provides a synchronization marker between blocks. So bzip2 splits data into 16 blocks and sends the data to 16 mappers. Each mapper gets decompressed data size of 1 input split size = 128 MB. (of-course if data is not exactly multiple of 128 MB, last mapper will get less data)

这篇关于Hadoop输入拆分为压缩块的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆