如何强制Hadoop解压缩其扩展名的输入? [英] How to force Hadoop to unzip inputs regadless of their extension?

查看:371
本文介绍了如何强制Hadoop解压缩其扩展名的输入?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在运行map-reduce,我的输入是gzipped,但没有.gz(文件名)扩展名。

通常,当它们具有.gz扩展名时,Hadoop负责在将它们传递给映射器之前快速解压缩它们。但是,如果没有扩展,它不会这样做。我无法重命名我的文件,因此我需要一些强制Hadoop将其解压缩的方法,即使它们没有.gz扩展名。



我尝试将以下标志传递给Hadoop:
$ b

step_args = [-jobconf,stream。 recordreader.compression = gzip,
-jobconf,mapred.output.compress = true,
-jobconf,mapred.output.compression.type = block,
-jobconf,mapred.output.compression.codec = org.apache.hadoop.io.compress.GzipCodec]

然而,映射器的输入仍然解压缩。我验证了通过在映射程序代码中打印映射器的输入:

映射器输入:^ _ ^ @%r?T ^ B ?? \ K ?? 6 ^ R?+ F?3 ^ d 14 b ^ R,??!???一个?^ X?一个32 N 2米·K?3ID 2 O 3 Z [?? L2yt ^ P $ 2 N T,^ V ^ ????,θY ^ o ^ R NNO>} ^ B ^ E ^ N-7 ^ Z'我OF4 ?? - ^ Z ^ X4;???????? ·F·RH ??? ^ Z?Q 20 4#^ W?我?^ F 14 ^]?F + ??? f0d ?? A 10 v·A3 * ???? 7·X 2 p ?? 7?了Mq?.G ?? {^ FL·G·^ Y +?6 ??我???? ^ V C 19 I 10 $ ?? ESCVd)K ??} - Z ??Ĵ ?,3?{?} v ??? j ??? ^ ???。?? ^ L?^?LX ^ F ?? p ???

关于如何在飞行中解压缩将不胜感激!



谢谢!
Gil。

解决方案

您需要修改 LineRecordReader 类的源以修改它选择压缩编解码器的方式。默认版本创建Hadoop CompressionCodecFactory 并调用 getCodec 来解析其扩展名的文件路径,您可以使用 getCodecByClassName 来获得你想要的任何编解码器。



然后,你需要覆盖你的输入格式类,使我不要使用你的新唱片阅读器。详情在这里: http://daynebatten.com/2015/11 / override-hadoop-compression-codec-file-extension /


I'm running map-reduce and my inputs are gzipped, but do not have a .gz (file name) extension.

Normally, when they do have the .gz extension, Hadoop takes care of unzipping them on the fly before passing them to the mapper. However, without the extension it doesn't do so. I can't rename my files, so I need some way of "forcing" Hadoop to unzip them, even though they do not have the .gz extension.

I tried passing the following flags to Hadoop:

step_args=[ "-jobconf", "stream.recordreader.compression=gzip", "-jobconf", "mapred.output.compress=true", "-jobconf", "mapred.output.compression.type=block", "-jobconf", "mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec"]

However, the input to the mapper is still unzipped. I verified that by printing the inputs to the mapper inside the mapper code:

mapper input: ^_^@%r?T^B??\K??6^R?+F?3^D??b?^R,??!???a?^X?A??n?m?k?3id?o?z[?-?L2yt^P$n?T,^V????^??y^O^R?nno>}^B^E^N-7?^Z?'?I?OF4??-^Z^X4;????f?RH???^Z?Q??4#^W?I?^F??^]?f+???f0d??A??v?A3*????7?x?p??7?Mq?.g??{^FL?g?^Y+?6??I????^V?C??I??$??ESCVd)K??}?Z??j?,3?{ ?}v???j???^??"?.??^L?^?LX^F??p???

Any advice on how to unzip on the fly would be greatly appreciated !

Thanks! Gil.

解决方案

You need to modify the source of the LineRecordReader class to modify how it chooses a compression codec. The default version creates a Hadoop CompressionCodecFactory and calls getCodec which parses a file path for its extension. You can instead use getCodecByClassName to obtain any codec you want.

You'll then need to override your input format class to make it use your new record reader. Details here: http://daynebatten.com/2015/11/override-hadoop-compression-codec-file-extension/

这篇关于如何强制Hadoop解压缩其扩展名的输入?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆