Hadoop输入的最佳可拆分压缩= bz2? [英] Best splittable compression for Hadoop input = bz2?

查看:185
本文介绍了Hadoop输入的最佳可拆分压缩= bz2?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们已经意识到,将GZip格式的文件归档为Hadoop处理并不是一个好主意。 GZip不可拆分,以供参考,以下是我不再重复的问题:



我的问题是:BZip2是最好的档案压缩方式,它允许Hadoop并行处理单个档案文件? Gzip绝对不是,从我的阅读LZO有一些问题。 解析方案

hadoop - 它提供了非常好的压缩比,但是由于CPU时间和性能并没有提供最佳结果,因为压缩非常耗费CPU资源。



LZO 可以在hadoop中分割 - 利用 hadoop-lzo ,您可以分割压缩LZO文件。您需要具有外部.lzo.index文件才能够并行处理。该库提供了以本地或分布式方式生成这些索引的所有方法。


$ b LZ4 可以在hadoop中拆分 - 利用 hadoop-4mc 你可以分割压缩4mc文件。您不需要任何外部索引,并且可以使用提供的命令行工具或Java / C代码(在hadoop内部/外部)生成归档。 4mc可以在任何速度/压缩比水平上使用hadoop LZ4:从快速模式达到500 MB / s压缩速度直到高/超模式,提供更高的压缩比,几乎与GZIP相当。

We've realized a bit too late that archiving our files in GZip format for Hadoop processing isn't such a great idea. GZip isn't splittable, and for reference, here are the problems which I won't repeat:

My question is: is BZip2 the best archival compression that will allow a single archive file to be processed in parallel by Hadoop? Gzip is definitely not, and from my reading LZO has some problems.

解决方案

BZIP2 is splittable in hadoop - it provides very good compression ratio but from CPU time and performances is not providing optimal results, as compression is very CPU consuming.

LZO is splittable in hadoop - leveraging hadoop-lzo you have splittable compressed LZO files. You need to have external .lzo.index files to be able to process in parallel. The library provides all means of generating these indexes in local or distributed manner.

LZ4 is splittable in hadoop - leveraging hadoop-4mc you have splittable compressed 4mc files. You don't need any external indexing, and you can generate archives with provided command line tool or by Java/C code, inside/outside hadoop. 4mc makes available on hadoop LZ4 at any level of speed/compression-ratio: from fast mode reaching 500 MB/s compression speed up to high/ultra modes providing increased compression ratio, almost comparable with GZIP one.

这篇关于Hadoop输入的最佳可拆分压缩= bz2?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆