Hadoop:提供目录作为 MapReduce 作业的输入 [英] Hadoop : Provide directory as input to MapReduce job

查看:27
本文介绍了Hadoop:提供目录作为 MapReduce 作业的输入的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用的是 Cloudera Hadoop.我能够运行简单的 mapreduce 程序,在其中我提供一个文件作为 MapReduce 程序的输入.

I'm using Cloudera Hadoop. I'm able to run simple mapreduce program where I provide a file as input to MapReduce program.

此文件包含所有其他要由映射器函数处理的文件.

This file contains all the other files to be processed by mapper function.

但是,我有一点被卡住了.

But, I'm stuck at one point.

/folder1
  - file1.txt
  - file2.txt
  - file3.txt

如何将 MapReduce 程序的输入路径指定为 "/folder1",以便它可以开始处理该目录中的每个文件?

How can I specify the input path to MapReduce program as "/folder1", so that it can start processing each file inside that directory ?

有什么想法吗?

1) 首先,我提供了 inputFile.txt 作为 mapreduce 程序的输入.它运行良好.

1) Intiailly, I provided the inputFile.txt as input to mapreduce program. It was working perfectly.

>inputFile.txt
file1.txt
file2.txt
file3.txt

2) 但是现在,我想在命令行上提供一个输入目录作为 arg[0],而不是提供输入文件.

2) But now, instead of giving an input file, I want to provide with an input directory as arg[0] on command line.

hadoop jar ABC.jar /folder1 /output

推荐答案

问题是 FileInputFormat 没有在输入路径目录中递归读取文件.

The Problem is FileInputFormat doesn't read files recursively in the input path dir.

解决方案:使用以下代码

FileInputFormat.setInputDirRecursive(job, true); 在你的 Map Reduce 代码下面一行之前

FileInputFormat.setInputDirRecursive(job, true); Before below line in your Map Reduce Code

FileInputFormat.addInputPath(job, new Path(args[0]));

您可以在此处查看已修复的版本.

You can check here for which version it was fixed.

这篇关于Hadoop:提供目录作为 MapReduce 作业的输入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆