Hadoop gzip压缩文件 [英] Hadoop gzip compressed files
问题描述
我是hadoop的新手,并试图处理维基百科转储。这是一个6.7 GB的gzip压缩的xml文件。我读到hadoop支持gzip压缩文件,但只能在单个作业上由映射程序处理,因为只有一个映射器可以解压缩它。这似乎对加工产生了限制。有办法吗像解压缩并将xml文件分解成多个块并使用gzip重新压缩它们。
I am new to hadoop and trying to process wikipedia dump. It's a 6.7 GB gzip compressed xml file. I read that hadoop supports gzip compressed files but can only be processed by mapper on a single job as only one mapper can decompress it. This seems to put a limitation on the processing. Is there an alternative? like decompressing and splitting the xml file into multiple chunks and recompressing them with gzip.
我从 http://researchcomputing.blogspot.com/2008/04/hadoop-and-compressed-files.html
感谢您的帮助。
推荐答案
GZIP编解码器由于该编解码器的工作原理而无法拆分。
Hadoop中的单个SPLIT只能由单个映射器处理;所以一个GZIP文件只能由一个Mapper处理。
A file compressed with the GZIP codec cannot be split because of the way this codec works. A single SPLIT in Hadoop can only be processed by a single mapper; so a single GZIP file can only be processed by a single Mapper.
至少有三种方法可以解决这个限制:
There are atleast three ways of going around that limitation:
- 作为一个预处理步骤:解压缩文件并使用可分解的编解码器(LZO)重新压缩
- 作为预处理步骤:解压缩文件,拆分进入更小的套件并重新压缩。 (查看此)
- 对于Hadoop(我写的)使用这个补丁,可以解决这个问题: Splittable Gzip
- As a preprocessing step: Uncompress the file and recompress using a splittable codec (LZO)
- As a preprocessing step: Uncompress the file, split into smaller sets and recompress. (See this)
- Use this patch for Hadoop (which I wrote) that allows for a way around this: Splittable Gzip
HTH
这篇关于Hadoop gzip压缩文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!