在Hadoop MapReduce中为单个地图提供多个非文本文件 [英] Providing several non-textual files to a single map in Hadoop MapReduce

查看:111
本文介绍了在Hadoop MapReduce中为单个地图提供多个非文本文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在编写分布式应用程序,它在Hadoop MapReduce的帮助下解析PDF文件。 MapReduce作业的输入是成千上万个PDF文件(大部分范围从100KB到2MB),输出是一组解析文本文件。为了测试目的,最初我使用了Tom White的 Hadoop中提供的 WholeFileInputFormat 。权威指南书,它提供单个文件到单个地图。这可以很好地处理少量的输入文件,但是,由于显而易见的原因,它无法与数千个文件正常工作。任务花费大约一秒钟完成的单个映射效率不高。

因此,我想要做的是将几个PDF文件提交到一个Map中(例如,将多个文件合并到单个块中,其大小约为64MB) 。我发现 CombineFileInputFormat 对我的情况很有用。然而,我不知道如何扩展这个抽象类,以便我可以将每个文件及其文件名处理为一个Key-Value记录。



任何帮助表示赞赏。感谢!

解决方案

我认为SequenceFile会适合您的需求: http://wiki.apache.org/hadoop/SequenceFile



基本上,你把所有的将PDF转换为序列文件,映射器将接收与序列文件中的一个HDFS块一样多的PDF。当您创建序列文件时,您将把密钥设置为PDF文件名,并且该值将是PDF的二进制表示形式。


I'm currently writing distributed application which parses Pdf files with the help of Hadoop MapReduce. Input to MapReduce job is thousands of Pdf files (which mostly range from 100KB to ~2MB), and output is a set of parsed text files.

For testing purposes, initially I used WholeFileInputFormat provided in Tom White's Hadoop. The Definitive Guide book, which provides single file to single map. This worked fine with small number of input files, however, it does not work properly with thousands of files for obvious reasons. Single map for the task which takes around a second to complete is inefficient.

So, what I want to do is to submit several Pdf files into one Map (for example, combining several files into single chunk which has around HDFS block size ~64MB). I found out that CombineFileInputFormat is useful for my case. However I cannot come out with idea how to extend that abstract class, so that I can process each file and its filename as a single Key-Value record.

Any help is appreciated. Thanks!

解决方案

I think a SequenceFile will suit your needs here: http://wiki.apache.org/hadoop/SequenceFile

Essentially, you put all your PDFs into a sequence file and the mappers will receive as many PDFs as fit into one HDFS block of the sequence file. When you create the sequence file, you'll set the key to be the PDF filename, and the value will be the binary representation of the PDF.

这篇关于在Hadoop MapReduce中为单个地图提供多个非文本文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆