运行 Hadoop MapReduce 作业时，如何获取文件名/文件内容作为 MAP 的键/值输入? [英] How to get Filename/File Contents as key/value input for MAP when running a Hadoop MapReduce Job?

查看：53 发布时间：2021/12/15 19:22:50 java hadoop mapreduce distributed-system

本文介绍了运行 Hadoop MapReduce 作业时，如何获取文件名/文件内容作为 MAP 的键/值输入?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在创建一个程序来分析 PDF、DOC 和 DOCX 文件.这些文件存储在 HDFS 中.

I am creating a program to analyze PDF, DOC and DOCX files. These files are stored in HDFS.

当我开始我的 MapReduce 作业时，我希望映射函数将文件名作为键，将二进制内容作为值.然后我想创建一个流阅读器，我可以将它传递给 PDF 解析器库.我怎样才能实现映射阶段的键/值对是文件名/文件内容?

When I start my MapReduce job, I want the map function to have the Filename as key and the Binary Contents as value. I then want to create a stream reader which I can pass to the PDF parser library. How can I achieve that the key/value pair for the Map Phase is filename/filecontents?

我使用的是 Hadoop 0.20.2

I am using Hadoop 0.20.2

这是启动作业的旧代码:

This is older code that starts a job:

public static void main(String[] args) throws Exception {
 JobConf conf = new JobConf(PdfReader.class);
 conf.setJobName("pdfreader");

 conf.setOutputKeyClass(Text.class);
 conf.setOutputValueClass(IntWritable.class);

 conf.setMapperClass(Map.class);
 conf.setReducerClass(Reduce.class);

 conf.setInputFormat(TextInputFormat.class);
 conf.setOutputFormat(TextOutputFormat.class);

 FileInputFormat.setInputPaths(conf, new Path(args[0]));
 FileOutputFormat.setOutputPath(conf, new Path(args[1]));

 JobClient.runJob(conf);
}

我知道还有其他输入格式类型.但是有没有一个完全符合我的要求?我发现文档很模糊.如果有，那么Map函数的输入类型应该怎么看?

I Know there are other inputformat types. But is there one that does exactly what I want? I find the documentation quite vague. If there is one available, then how should the Map function input types look?

提前致谢！

运行 Hadoop MapReduce 作业时，如何获取文件名/文件内容作为 MAP 的键/值输入? [英] How to get Filename/File Contents as key/value input for MAP when running a Hadoop MapReduce Job?

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

运行 Hadoop MapReduce 作业时，如何获取文件名/文件内容作为 MAP 的键/值输入? [英] How to get Filename/File Contents as key/value input for MAP when running a Hadoop MapReduce Job?

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭