Hadoop gzip 压缩文件 [英] Hadoop gzip compressed files
问题描述
我是 hadoop 的新手,正在尝试处理维基百科转储.这是一个 6.7 GB 的 gzip 压缩 xml 文件.我读到 hadoop 支持 gzip 压缩文件,但只能由 mapper 在单个作业中处理,因为只有一个 mapper 可以解压缩它.这似乎限制了处理.有替代方案吗?比如将xml文件解压并拆分成多个块,然后用gzip重新压缩.
I am new to hadoop and trying to process wikipedia dump. It's a 6.7 GB gzip compressed xml file. I read that hadoop supports gzip compressed files but can only be processed by mapper on a single job as only one mapper can decompress it. This seems to put a limitation on the processing. Is there an alternative? like decompressing and splitting the xml file into multiple chunks and recompressing them with gzip.
我从 http://researchcomputing 阅读了有关 hadoop gzip 的信息.blogspot.com/2008/04/hadoop-and-compressed-files.html
感谢您的帮助.
推荐答案
无法拆分使用 GZIP 编解码器压缩的文件,因为此编解码器的工作方式.Hadoop 中的单个 SPLIT 只能由单个映射器处理;因此单个 GZIP 文件只能由单个 Mapper 处理.
A file compressed with the GZIP codec cannot be split because of the way this codec works. A single SPLIT in Hadoop can only be processed by a single mapper; so a single GZIP file can only be processed by a single Mapper.
至少有三种方法可以绕过这个限制:
There are atleast three ways of going around that limitation:
- 作为预处理步骤:解压缩文件并使用可拆分编解码器 (LZO) 重新压缩
- 作为预处理步骤:解压缩文件,拆分成更小的集合并重新压缩.(见这个)
- 将此补丁用于 Hadoop(我编写的),可以解决此问题:可拆分的 Gzip
HTH
这篇关于Hadoop gzip 压缩文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!