Hadoop的GZIP COM pressed文件 [英] Hadoop gzip compressed files

查看:148
本文介绍了Hadoop的GZIP COM pressed文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是新来的Hadoop,并试图处理维基百科转储。这是一个6.7 GB的gzip COM pressed XML文件。我读了Hadoop的支持gzip的COM pressed文件,但只能通过映射在一个工作作为唯一一个映射器DECOM preSS它处理。这似乎把一个限制的处理。是否有别的选择吗?像DECOM pressing和拆分XML文件分成多个块,RECOM $ P $用gzip pssing他们。

I am new to hadoop and trying to process wikipedia dump. It's a 6.7 GB gzip compressed xml file. I read that hadoop supports gzip compressed files but can only be processed by mapper on a single job as only one mapper can decompress it. This seems to put a limitation on the processing. Is there an alternative? like decompressing and splitting the xml file into multiple chunks and recompressing them with gzip.

我读到有关从<一个Hadoop的GZIP href="http://researchcomputing.blogspot.com/2008/04/hadoop-and-com$p$pssed-files.html">http://researchcomputing.blogspot.com/2008/04/hadoop-and-com$p$pssed-files.html

感谢您的帮助。

推荐答案

一个文件,COM pressed与GZIP codeC无法在此codeC的作品,因为道路分割。 在Hadoop中单个裂口只能由一个单一的映射处理;因此单个GZIP文件只能由单个映射器进行处理。

A file compressed with the GZIP codec cannot be split because of the way this codec works. A single SPLIT in Hadoop can only be processed by a single mapper; so a single GZIP file can only be processed by a single Mapper.

有ATLEAST三种方式去解决这个限制:

There are atleast three ways of going around that limitation:

  1. 作为preprocessing步:Uncom preSS文件,RECOM preSS使用的可分裂codeC(LZO)
  2. 作为preprocessing步:Uncom preSS文件,分割成更小集,RECOM preSS。 (<一href="http://stackoverflow.com/questions/3960651/splitting-gzipped-logfiles-without-storing-the-ungzipped-splits-on-disk">See这)
  3. 使用这个补丁的Hadoop(这是我写的),它允许一个办法解决:可裂Gzip已
  1. As a preprocessing step: Uncompress the file and recompress using a splittable codec (LZO)
  2. As a preprocessing step: Uncompress the file, split into smaller sets and recompress. (See this)
  3. Use this patch for Hadoop (which I wrote) that allows for a way around this: Splittable Gzip

心连心

这篇关于Hadoop的GZIP COM pressed文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆