Hadoop:提供目录作为MapReduce作业的输入 [英] Hadoop : Provide directory as input to MapReduce job
问题描述
我正在使用Cloudera Hadoop。我能够运行简单的mapreduce程序,我提供一个文件作为MapReduce程序的输入。
该文件包含要由mapper函数处理的所有其他文件。 / p>
但是,我一直在坚持。
/ folder1
- file1.txt
- file2.txt
- file3.txt
如何将MapReduce程序的输入路径指定为/ folder1
,以便它可以开始处理该目录中的每个文件?
有什么想法?
编辑:
1)Intiailly,我提供了inputFile.txt作为mapreduce程序的输入。
> inputFile.txt
file1.txt
file2.txt
file3.txt
2)但现在,我不想提供输入文件,而是想提供在命令行中输入目录为arg [0]。
hadoop jar ABC.jar / folder1 / output
问题是FileInputFormat不会在输入路径中递归读取文件dir 。
解决方案:使用下列代码
您可以检查这里是它修复的版本。 I'm using Cloudera Hadoop. I'm able to run simple mapreduce program where I provide a file as input to MapReduce program. This file contains all the other files to be processed by mapper function. But, I'm stuck at one point. How can I specify the input path to MapReduce program as Any ideas ? EDIT : 1) Intiailly, I provided the inputFile.txt as input to mapreduce program. It was working perfectly. 2) But now, instead of giving an input file, I want to provide with an input directory as arg[0] on command line.
The Problem is FileInputFormat doesn't read files recursively in the input path dir. Solution: Use Following code You can check here for which version it was fixed. 这篇关于Hadoop:提供目录作为MapReduce作业的输入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋! FileInputFormat
$ b FileInputFormat.addInputPath(job, new Path(args [0]));
/folder1
- file1.txt
- file2.txt
- file3.txt
"/folder1"
, so that it can start processing each file inside that directory ?>inputFile.txt
file1.txt
file2.txt
file3.txt
hadoop jar ABC.jar /folder1 /output
FileInputFormat.setInputDirRecursive(job, true);
Before below line in your Map Reduce CodeFileInputFormat.addInputPath(job, new Path(args[0]));