Hadoop gzip压缩文件 [英] Hadoop gzip compressed files

查看：280 发布时间：2017/4/3 11:13:00 java algorithm data-structures hadoop mapreduce

本文介绍了Hadoop gzip压缩文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我是hadoop的新手，并试图处理维基百科转储。这是一个6.7 GB的gzip压缩的xml文件。我读到hadoop支持gzip压缩文件，但只能在单个作业上由映射程序处理，因为只有一个映射器可以解压缩它。这似乎对加工产生了限制。有办法吗像解压缩并将xml文件分解成多个块并使用gzip重新压缩它们。

I am new to hadoop and trying to process wikipedia dump. It's a 6.7 GB gzip compressed xml file. I read that hadoop supports gzip compressed files but can only be processed by mapper on a single job as only one mapper can decompress it. This seems to put a limitation on the processing. Is there an alternative? like decompressing and splitting the xml file into multiple chunks and recompressing them with gzip.

我从 http://researchcomputing.blogspot.com/2008/04/hadoop-and-compressed-files.html

感谢您的帮助。

推荐答案

GZIP编解码器由于该编解码器的工作原理而无法拆分。
Hadoop中的单个SPLIT只能由单个映射器处理;所以一个GZIP文件只能由一个Mapper处理。

A file compressed with the GZIP codec cannot be split because of the way this codec works. A single SPLIT in Hadoop can only be processed by a single mapper; so a single GZIP file can only be processed by a single Mapper.

至少有三种方法可以解决这个限制：

There are atleast three ways of going around that limitation:

作为一个预处理步骤：解压缩文件并使用可分解的编解码器（LZO）重新压缩

作为预处理步骤：解压缩文件，拆分进入更小的套件并重新压缩。（查看此）

对于Hadoop（我写的）使用这个补丁，可以解决这个问题： Splittable Gzip

As a preprocessing step: Uncompress the file and recompress using a splittable codec (LZO)
As a preprocessing step: Uncompress the file, split into smaller sets and recompress. (See this)
Use this patch for Hadoop (which I wrote) that allows for a way around this: Splittable Gzip

HTH

这篇关于Hadoop gzip压缩文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Hadoop gzip压缩文件 [英] Hadoop gzip compressed files

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

Hadoop gzip压缩文件 [英] Hadoop gzip compressed files

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭