关于Hadoop和压缩输入文件的非常基本的问题 [英] Very basic question about Hadoop and compressed input files

查看:223
本文介绍了关于Hadoop和压缩输入文件的非常基本的问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经开始研究Hadoop了。如果我的理解是正确的,我可以处理一个非常大的文件,它会得到分裂在不同的节点,然而如果文件被压缩,然后文件不能被分割,并且需要由一个单一的节点处理(有效地破坏的优势运行一个mapreduce ver一个并行机的集群)。

I have started to look into Hadoop. If my understanding is right i could process a very big file and it would get split over different nodes, however if the file is compressed then the file could not be split and wold need to be processed by a single node (effectively destroying the advantage of running a mapreduce ver a cluster of parallel machines).

我的问题是,假设上述是正确的,可以手动分割一个大型文件固定大小的块,或每日块,压缩它们,然后传递压缩输入文件的列表以执行mapreduce?

My question is, assuming the above is correct, is it possible to split a large file manually in fixed-size chunks, or daily chunks, compress them and then pass a list of compressed input files to perform a mapreduce?

推荐答案

考虑使用LZO压缩。它是可拆分的。这意味着一个大的.lzo文件可以被许多映射器处理。

Consider using LZO compression. It's splittable. That means a big .lzo file can be processed by many mappers. Bzip2 can do that, but it's slow.

Cloudera有一个介绍。对于MapReduce,LZO在压缩率和压缩/解压缩速度之间有很好的平衡。

Cloudera had an introduction about it. For MapReduce, LZO sounds a good balance between compression ratio and compress/decompress speed.

这篇关于关于Hadoop和压缩输入文件的非常基本的问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆