java中的mapreduce - gzip输入文件 [英] mapreduce in java - gzip input files

查看:359
本文介绍了java中的mapreduce - gzip输入文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 java ,并且我正在尝试编写一个 mapreduce 来接收输入一个包含多个 gz 文件的文件夹。



我一直在寻找所有的教程,发现exmplain如何处理一个简单的文本文件,但没有找到任何解决我的问题。



我在工作场所问过,但只有参考



任何帮助将不胜感激。

解决方案

Hadoop检查文件扩展名以检测压缩文件。 Hadoop支持的压缩类型有:gzip,bzip2和LZO。您不需要采取任何其他操作来使用这些类型的压缩来提取文件; Hadoop会为你处理它。



因此,您必须像编辑文本文件一样编写逻辑,并将包含.gz文件的目录作为输入。

但gzip文件的问题在于它们不是可拆分的,想象一下每个5GB的gzip文件,然后每个mapper将处理整个5GB文件,而不是使用默认块大小。


I'm using java, and i'm trying to write a mapreduce that will recieve as an Input a folder containing multiple gz files.

I've been looking all over but all the tutorials that i've found exmplain how to process a simple text File, but haven't found anything that solves my problem.

I've asked around at my workplace, but only got references to scala, which i'm not familier with.

Any help would be appreciated.

解决方案

Hadoop checks the file extension to detect compressed files. The compression types supported by Hadoop are: gzip, bzip2, and LZO. You do not need to take any additional action to extract files using these types of compression; Hadoop handles it for you.

So all you have to do write the logic as you would for a text file and pass in the directory which contains the .gz files as input.

But the issue with gzip files is that they are not splittable, imagine you have gzip files of each 5GB, then each mapper will process on the whole 5GB file instead of working with the default block size.

这篇关于java中的mapreduce - gzip输入文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆