以编程方式读取 Hadoop Mapreduce 程序的输出 [英] Programmatically reading the output of Hadoop Mapreduce Program

查看:28
本文介绍了以编程方式读取 Hadoop Mapreduce 程序的输出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这可能是一个基本问题,但我在 Google 上找不到答案.
我有一个 map-reduce 作业,它在其输出目录中创建多个输出文件.我的 Java 应用程序在远程 hadoop 集群上执行此作业,作业完成后,它需要使用 org.apache.hadoop.fs.FileSystem API 以编程方式读取输出.有可能吗?
应用程序知道输出目录,但不知道 map-reduce 作业生成的输出文件的名称.似乎没有办法以编程方式列出 hadoop 文件系统 API 中目录的内容.如何读取输出文件?
这似乎是一个司空见惯的场景,我相信它有一个解决方案.但是我遗漏了一些非常明显的东西.

This may be a basic question, but I could not find an answer for it on Google.
I have a map-reduce job that creates multiple output files in its output directory. My Java application executes this job on a remote hadoop cluster and after the job is finished, it needs to read the output programatically using org.apache.hadoop.fs.FileSystem API. Is it possible?
The application knows the output directory, but not the names of the output files generated by the map-reduce job. It seems there is no way to programatically list the contents of a directory in the hadoop file system API. How will the output files be read?
It seems such a commonplace scenario, that I am sure it has a solution. But I am missing something very obvious.

推荐答案

你要找的方法叫listStatus(Path).它只是将 Path 内的所有文件作为 FileStatus 数组返回.然后你可以简单地循环它们创建一个路径对象并读取它.

The method you are looking for is called listStatus(Path). It simply returns all files inside of a Path as a FileStatus array. Then you can simply loop over them create a path object and read it.

    FileStatus[] fss = fs.listStatus(new Path("/"));
    for (FileStatus status : fss) {
        Path path = status.getPath();
        SequenceFile.Reader reader = new SequenceFile.Reader(fs, path, conf);
        IntWritable key = new IntWritable();
        IntWritable value = new IntWritable();
        while (reader.next(key, value)) {
            System.out.println(key.get() + " | " + value.get());
        }
        reader.close();
    }

对于 Hadoop 2.x,您可以像这样设置阅读器:

For Hadoop 2.x you can setup the reader like this:

 SequenceFile.Reader reader = 
           new SequenceFile.Reader(conf, SequenceFile.Reader.file(path))

这篇关于以编程方式读取 Hadoop Mapreduce 程序的输出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆