以编程方式读取Hadoop Mapreduce程序的输出 [英] Programmatically reading the output of Hadoop Mapreduce Program

查看:107
本文介绍了以编程方式读取Hadoop Mapreduce程序的输出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这可能是一个基本问题,但我无法在Google上找到答案。

我有一个map-reduce作业,它在其输出目录中创建多个输出文件。
我的Java应用程序在远程hadoop集群上执行此作业,并在作业完成后,需要使用 org.apache.hadoop.fs.FileSystem API。是否有可能?
应用程序知道输出目录,但不知道map-reduce作业生成的输出文件的名称。似乎没有办法以编程方式在hadoop文件系统API中列出目录的内容。输出文件将如何被读取?


这似乎是一个普遍的场景,我相信它有一个解决方案。但是我缺少一些非常明显的东西。

This may be a basic question, but I could not find an answer for it on Google.
I have a map-reduce job that creates multiple output files in its output directory. My Java application executes this job on a remote hadoop cluster and after the job is finished, it needs to read the output programatically using org.apache.hadoop.fs.FileSystem API. Is it possible?
The application knows the output directory, but not the names of the output files generated by the map-reduce job. It seems there is no way to programatically list the contents of a directory in the hadoop file system API. How will the output files be read?
It seems such a commonplace scenario, that I am sure it has a solution. But I am missing something very obvious.

推荐答案

您正在查找的方法称为listStatus(Path)。
它只是将Path内的所有文件作为FileStatus数组返回。然后你可以简单地遍历它们来创建一个路径对象并阅读它。

The method you are looking for is called listStatus(Path). It simply returns all files inside of a Path as a FileStatus array. Then you can simply loop over them create a path object and read it.

    FileStatus[] fss = fs.listStatus(new Path("/"));
    for (FileStatus status : fss) {
        Path path = status.getPath();
        SequenceFile.Reader reader = new SequenceFile.Reader(fs, path, conf);
        IntWritable key = new IntWritable();
        IntWritable value = new IntWritable();
        while (reader.next(key, value)) {
            System.out.println(key.get() + " | " + value.get());
        }
        reader.close();
    }

对于Hadoop 2.x,您可以像这样设置阅读器:

For Hadoop 2.x you can setup the reader like this:

 SequenceFile.Reader reader = 
           new SequenceFile.Reader(conf, SequenceFile.Reader.file(path))

这篇关于以编程方式读取Hadoop Mapreduce程序的输出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆