获取Hadoop Mapper中的总输入路径计数 [英] Get Total Input Path Count in Hadoop Mapper

查看:86
本文介绍了获取Hadoop Mapper中的总输入路径计数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们正在尝试获取MapReduce程序正在映射器中迭代的输入路径的总数.我们将使用它和一个计数器来根据索引格式化我们的值.有没有一种简单的方法可以从映射器中提取总输入路径数?预先感谢.

解决方案

您可以查看FileInputFormat.getSplits()的源代码-这会拉回mapred.input.dir的配置属性,然后将此CSV解析为路径数组. /p>

这些路径仍然可以表示文件夹和正则表达式,因此getSplits()的下一步是将数组传递给受保护的方法org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(JobContext).实际上,这会遍历列出的dirs/regex,并列出目录/regex匹配文件(如果已配置,还会调用PathFilter.)

因此,在保护此方法的情况下,您可以创建一个简单的FileInputFormat'dummy'扩展名,该扩展名具有listStatus方法,接受Mapper.Context作为其参数,然后包装对FileInputFormat.listStatus方法的调用:

public class DummyFileInputFormat extends FileInputFormat {
    public List<FileStatus> listStatus(Context mapContext) throws IOException {
        return super.listStatus(mapContext);
    }

    @Override
    public RecordReader createRecordReader(InputSplit split,
            TaskAttemptContext context) throws IOException,
            InterruptedException {
        // dummy input format, so this will never be called
        return null;
    }
}

编辑:实际上,FileInputFormat已经为您完成了此任务,在getSplits()方法的末尾配置了作业属性mapreduce.input.num.files(至少在1.0.2中,大概在0.20.203中引入)

这是JIRA票证

We are trying to grab the total number of input paths our MapReduce program is iterating through in our mapper. We are going to use this along with a counter to format our value depending on the index. Is there an easy way to pull the total input path count from the mapper? Thanks in advance.

解决方案

You could look through the source for FileInputFormat.getSplits() - this pulls back the configuration property for mapred.input.dir and then resolves this CSV to an array of Paths.

These paths can still represent folders and regex's so the next thing getSplits() does is to pass the array to a protected method org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(JobContext). This actually goes through the dirs / regex's listed and lists the directory / regex matching files (also invoking a PathFilter if configured).

So with this method being protected, you could create a simple 'dummy' extension of FileInputFormat that has a listStatus method, accepting the Mapper.Context as it's argument, and in turn wrap a call to the FileInputFormat.listStatus method:

public class DummyFileInputFormat extends FileInputFormat {
    public List<FileStatus> listStatus(Context mapContext) throws IOException {
        return super.listStatus(mapContext);
    }

    @Override
    public RecordReader createRecordReader(InputSplit split,
            TaskAttemptContext context) throws IOException,
            InterruptedException {
        // dummy input format, so this will never be called
        return null;
    }
}

EDIT: In fact it looks like FileInputFormat already does this for you, configuring a job property mapreduce.input.num.files at the end of the getSplits() method (at least in 1.0.2, probably introduced in 0.20.203)

Here's the JIRA ticket

这篇关于获取Hadoop Mapper中的总输入路径计数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆