Hadoop gzip输入文件只使用一个映射器 [英] Hadoop gzip input file using only one mapper

查看:120
本文介绍了Hadoop gzip输入文件只使用一个映射器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述


可能存在重复:

为什么不能hadoop分割大文本文件,然后使用gzip压缩分割?

我发现使用gzip的输入文件时,Hadoop选择只分配一个地图任务来处理我的map / reduce作业。

gzip文件大于1.4 GB,因此我预计很多mapper可以并行运行(非常类似于使用un-zipped文件)



是否有任何配置可以改善它?

解决方案

Gzip文件不能拆分,因此所有数据只能由一个地图处理。其他一些压缩文件可以被拆分的压缩算法必须被使用,然后数据将被多个地图处理。这里有一篇很好的文章。 (1)

编辑:这是另一篇关于Snappy(2)的文章,来自Google。 1) http:// blog.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/

(2) http://blog.cloudera.com/blog/2011/09/ snappy-and-hadoop /


Possible Duplicate:
Why can't hadoop split up a large text file and then compress the splits using gzip?

I found that when using input file that is gzipped the Hadoop chooses to allocate only one map task to handle my map/reduce job.

The gzipped file is more than 1.4 GB, so I would expect many mappers to run in parallel (exacly like when using un-zipped file)

Is there any configuration I can do to improve it?

解决方案

Gzip files can't be split, so all the data is being processed by only one map. Some other compression algorithm in which compressed files can be split has to be used, then the data will be processed by multiple maps. Here is a nice article on it. (1)

Edit: Here is another article on Snappy (2) which is from Google.

(1) http://blog.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/

(2) http://blog.cloudera.com/blog/2011/09/snappy-and-hadoop/

这篇关于Hadoop gzip输入文件只使用一个映射器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆