Hadoop:提供目录作为MapReduce作业的输入 [英] Hadoop : Provide directory as input to MapReduce job

查看:276
本文介绍了Hadoop:提供目录作为MapReduce作业的输入的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Cloudera Hadoop。我能够运行简单的mapreduce程序,我提供一个文件作为MapReduce程序的输入。



该文件包含要由mapper函数处理的所有其他文件。 / p>

但是,我一直在坚持。

  / folder1 
- file1.txt
- file2.txt
- file3.txt

如何将MapReduce程序的输入路径指定为/ folder1,以便它可以开始处理该目录中的每个文件?



有什么想法?

编辑:

1)Intiailly,我提供了inputFile.txt作为mapreduce程序的输入。

 > inputFile.txt 
file1.txt
file2.txt
file3.txt

2)但现在,我不想提供输入文件,而是想提供在命令行中输入目录为arg [0]。

  hadoop jar ABC.jar / folder1 / output 


解决方案

问题是FileInputFormat不会在输入路径中递归读取文件dir 。



解决方案:使用下列代码

FileInputFormat
$ b

FileInputFormat.addInputPath(job, new Path(args [0]));



您可以检查这里是它修复的版本。


I'm using Cloudera Hadoop. I'm able to run simple mapreduce program where I provide a file as input to MapReduce program.

This file contains all the other files to be processed by mapper function.

But, I'm stuck at one point.

/folder1
  - file1.txt
  - file2.txt
  - file3.txt

How can I specify the input path to MapReduce program as "/folder1", so that it can start processing each file inside that directory ?

Any ideas ?

EDIT :

1) Intiailly, I provided the inputFile.txt as input to mapreduce program. It was working perfectly.

>inputFile.txt
file1.txt
file2.txt
file3.txt

2) But now, instead of giving an input file, I want to provide with an input directory as arg[0] on command line.

hadoop jar ABC.jar /folder1 /output

解决方案

The Problem is FileInputFormat doesn't read files recursively in the input path dir.

Solution: Use Following code

FileInputFormat.setInputDirRecursive(job, true); Before below line in your Map Reduce Code

FileInputFormat.addInputPath(job, new Path(args[0]));

You can check here for which version it was fixed.

这篇关于Hadoop:提供目录作为MapReduce作业的输入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆