Hadoop gzip压缩文件 [英] Hadoop gzip compressed files

查看:280
本文介绍了Hadoop gzip压缩文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是hadoop的新手,并试图处理维基百科转储。这是一个6.7 GB的gzip压缩的xml文件。我读到hadoop支持gzip压缩文件,但只能在单个作业上由映射程序处理,因为只有一个映射器可以解压缩它。这似乎对加工产生了限制。有办法吗像解压缩并将xml文件分解成多个块并使用gzip重新压缩它们。

I am new to hadoop and trying to process wikipedia dump. It's a 6.7 GB gzip compressed xml file. I read that hadoop supports gzip compressed files but can only be processed by mapper on a single job as only one mapper can decompress it. This seems to put a limitation on the processing. Is there an alternative? like decompressing and splitting the xml file into multiple chunks and recompressing them with gzip.

我从 http://researchcomputing.blogspot.com/2008/04/hadoop-and-compressed-files.html

感谢您的帮助。

推荐答案

GZIP编解码器由于该编解码器的工作原理而无法拆分。
Hadoop中的单个SPLIT只能由单个映射器处理;所以一个GZIP文件只能由一个Mapper处理。

A file compressed with the GZIP codec cannot be split because of the way this codec works. A single SPLIT in Hadoop can only be processed by a single mapper; so a single GZIP file can only be processed by a single Mapper.

至少有三种方法可以解决这个限制:

There are atleast three ways of going around that limitation:


  1. 作为一个预处理步骤:解压缩文件并使用可分解的编解码器(LZO)重新压缩

  2. 作为预处理步骤:解压缩文件,拆分进入更小的套件并重新压缩。 (查看此

  3. 对于Hadoop(我写的)使用这个补丁,可以解决这个问题: Splittable Gzip

  1. As a preprocessing step: Uncompress the file and recompress using a splittable codec (LZO)
  2. As a preprocessing step: Uncompress the file, split into smaller sets and recompress. (See this)
  3. Use this patch for Hadoop (which I wrote) that allows for a way around this: Splittable Gzip

HTH

这篇关于Hadoop gzip压缩文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆