最简单有效的方法来压缩hadoop mapreduce的输出 [英] Easiest efficient way to zip output of hadoop mapreduce

查看:117
本文介绍了最简单有效的方法来压缩hadoop mapreduce的输出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我可以使用

 mapred.output.compression.codec = org.apache.hadoop压缩mapreduce输出到gzip。 io.compress.GzipCodec

为hadoop实现zip编解码器会简单吗? Zip是容器,但每个存档只需要一个文件,那么使用 CompressionCodec 接口创建 ZipCodec 很容易?



或者,也许有一种将 gz 文件转换为 zip s,因为它们可以使用相同的deflate算法?

解决方案

没什么大不了的, code> java.util.zip.ZipOutputStream 。



你可以通过实现你自己的编解码器来完成这个工作, org.apache.hadoop.io.compress.DefaultCodec



在这个编解码器中,分别扩展 org.apache.hadoop.io.compress.CompressorStream org.apache.hadoop.io.compress.DecompressorStream
$ b

最后,您必须重写 createInputStream createOutputStream 方法并返回一个包装流的新实例。



还是一个有点编码,我敢肯定在某个地方必须有一个已经存在的实现(我可能记得它也是在几年前的Hadoop发行版中)。

I can compress mapreduce output to gzip with

"mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec"

Will it be straightforward to implement zip codec for hadoop? Zip is container, but I need only one file per archive, so would it be easy to create ZipCodec with CompressionCodec interface?

Or, maybe there is an efficient way to convert gz files to zips, since they can use same deflate algorithm?

解决方案

No big deal, you can wrap a java.util.zip.ZipOutputStream.

You can do this by implementing your own codec, which is done by extending org.apache.hadoop.io.compress.DefaultCodec.

In this codec you wrap the java zip streams by extending org.apache.hadoop.io.compress.CompressorStream respectively org.apache.hadoop.io.compress.DecompressorStream.

In the end you have to override the createInputStream and createOutputStream method and return a new instance of the wrapped streams there.

Still a bit of coding, I'm pretty sure there must be an already existing implementation somewhere (I may recall it also was in a Hadoop release years ago).

这篇关于最简单有效的方法来压缩hadoop mapreduce的输出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆