Hadoop:提供目录作为 MapReduce 作业的输入 [英] Hadoop : Provide directory as input to MapReduce job
问题描述
我使用的是 Cloudera Hadoop.我能够运行简单的 mapreduce 程序,在其中我提供一个文件作为 MapReduce 程序的输入.
I'm using Cloudera Hadoop. I'm able to run simple mapreduce program where I provide a file as input to MapReduce program.
此文件包含所有其他要由映射器函数处理的文件.
This file contains all the other files to be processed by mapper function.
但是,我有一点被卡住了.
But, I'm stuck at one point.
/folder1
- file1.txt
- file2.txt
- file3.txt
如何将 MapReduce 程序的输入路径指定为 "/folder1"
,以便它可以开始处理该目录中的每个文件?
How can I specify the input path to MapReduce program as "/folder1"
, so that it can start processing each file inside that directory ?
有什么想法吗?
1) 首先,我提供了 inputFile.txt 作为 mapreduce 程序的输入.它运行良好.
1) Intiailly, I provided the inputFile.txt as input to mapreduce program. It was working perfectly.
>inputFile.txt
file1.txt
file2.txt
file3.txt
2) 但是现在,我想在命令行上提供一个输入目录作为 arg[0],而不是提供输入文件.
2) But now, instead of giving an input file, I want to provide with an input directory as arg[0] on command line.
hadoop jar ABC.jar /folder1 /output
推荐答案
问题是 FileInputFormat 没有在输入路径目录中递归读取文件.
The Problem is FileInputFormat doesn't read files recursively in the input path dir.
解决方案:使用以下代码
FileInputFormat.setInputDirRecursive(job, true);
在你的 Map Reduce 代码下面一行之前
FileInputFormat.setInputDirRecursive(job, true);
Before below line in your Map Reduce Code
FileInputFormat.addInputPath(job, new Path(args[0]));
您可以在此处查看已修复的版本.
You can check here for which version it was fixed.
这篇关于Hadoop:提供目录作为 MapReduce 作业的输入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!