在MapReduce中使用globStatus过滤输入文件 [英] Filtering input files using globStatus in MapReduce

查看:255
本文介绍了在MapReduce中使用globStatus过滤输入文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有很多输入文件,我想根据最后附加的日期处理选定的文件。我现在困惑我在哪里使用globStatus方法来过滤掉文件。



我有一个自定义的RecordReader类,我试图在其下一个方法中使用globStatus但它没有解决。

  public boolean next(Text key,Text value)throws IOException {
Path filePath = fileSplit.getPath();

if(!processed){
key.set(filePath.getName());

byte [] contents = new byte [(int)fileSplit.getLength()];
value.clear();
FileSystem fs = filePath.getFileSystem(conf);
fs.globStatus(new Path(/ *+ date));
FSDataInputStream in = null;

尝试{
in = fs.open(filePath);
IOUtils.readFully(in,contents,0,contents.length);
value.set(contents,0,contents.length);
} finally {
IOUtils.closeStream(in);
}
processed = true;
返回true;
}
返回false;
}

我知道它会返回一个FileStatus数组,但是如何使用它来过滤文件。有人可以说点什么吗?

解决方案

globStatus 方法需要2免费的参数可以让你过滤你的文件。第一个是glob模式,但有时glob模式不够强大,无法过滤特定文件,在这种情况下,您可以定义 PathFilter



关于glob模式,支持以下内容:

  Glob |匹配
--------------------------------------------- -------------------------------------------------- --------------------
* |匹配零个或多个字符
? |匹配单个字符
[ab] |匹配集合{a,b}
[^ ab] |中的单个字符匹配不在集合{a,b}
[a-b] |中的单个字符匹配[a,b]范围内的单个字符,其中a按字典顺序小于或等于b
[^ a-b] |匹配不在范围[a,b]中的单个字符,其中a按字典顺序小于或等于b
{a,b} |。匹配表达式a或b
\c |匹配字符c,它是元字符

PathFilter 仅仅是一个这样的接口:

  public interface PathFilter {
boolean accept(Path path);

$ / code>

所以你可以实现这个接口并实现 accept 方法,您可以将您的逻辑过滤文件。



Tom White的优秀书籍,它允许您定义一个 PathFilter 来筛选匹配特定常规表达式:

  public class RegexExcludePathFilter实现PathFilter {
private final String regex;

public RegexExcludePathFilter(String regex){
this.regex = regex;
}

public boolean accept(Path path){
return!path.toString()。matches(regex);




$ b您可以直接使用 PathFilter 通过调用 FileInputFormat.setInputPathFilter(JobConf,RegexExcludePathFilter.class)来实现。



编辑:因为您必须在 setInputPathFilter 中传递类,所以不能直接传递参数,但应该可以通过玩配置来做类似的事情。如果您使 RegexExcludePathFilter 也扩展到 Configured ,您可以获取 Configuration 对象,你将在这之前用所需的值初始化,所以你可以在过滤器中找回这些值,并在 accept 中处理它们。



例如,如果您初始化为这样:

  conf.set(date ,2013-01-15); 

然后您可以像这样定义您的过滤器:

  public class RegexIncludePathFilter extends Configured implements PathFilter {
private String date;
私有文件系统fs;

public boolean accept(Path path){
try {
if(fs.isDirectory(path)){
return true;
}
} catch(IOException e){}
return path.toString()。endsWith(date);
}

public void setConf(Configuration conf){
if(null!= conf){
this.date = conf.get(date);
尝试{
this.fs = FileSystem.get(conf);
} catch(IOException e){}
}
}
}

编辑2 :原始代码存在一些问题,请参阅更新的类。您还需要删除构造函数,因为它不再被使用,并检查这是否是一个目录,在这种情况下,您应该返回true,以便目录的内容也可以被过滤。


I have a lot of input files and I want to process selected ones based on the date that has been appended in the end. I am now confused on where do I use the globStatus method to filter out the files.

I have a custom RecordReader class and I was trying to use globStatus in its next method but it didn't work out.

public boolean next(Text key, Text value) throws IOException {
    Path filePath = fileSplit.getPath();

    if (!processed) {
        key.set(filePath.getName());

        byte[] contents = new byte[(int) fileSplit.getLength()];
        value.clear();
        FileSystem fs = filePath.getFileSystem(conf);
        fs.globStatus(new Path("/*" + date));
        FSDataInputStream in = null;

        try {
            in = fs.open(filePath);
            IOUtils.readFully(in, contents, 0, contents.length);
            value.set(contents, 0, contents.length);
        } finally {
            IOUtils.closeStream(in);
        }
        processed = true;
        return true;
    }
    return false;
}

I know it returns a FileStatus array, but how do I use it to filter the files. Can someone please shed some light?

解决方案

The globStatus method takes 2 complimentary arguments which allow you to filter your files. The first one is the glob pattern, but sometimes glob patterns are not powerful enough to filter specific files, in which case you can define a PathFilter.

Regarding the glob pattern, the following are supported:

Glob   | Matches
-------------------------------------------------------------------------------------------------------------------
*      | Matches zero or more characters
?      | Matches a single character
[ab]   | Matches a single character in the set {a, b}
[^ab]  | Matches a single character not in the set {a, b}
[a-b]  | Matches a single character in the range [a, b] where a is lexicographically less than or equal to b
[^a-b] | Matches a single character not in the range [a, b] where a is lexicographically less than or equal to b
{a,b}  | Matches either expression a or b
\c     | Matches character c when it is a metacharacter

PathFilter is simply an interface like this:

public interface PathFilter {
    boolean accept(Path path);
}

So you can implement this interface and implement the accept method where you can put your logic to filter files.

An example taken from Tom White's excellent book which allows you to define a PathFilter to filter files that match a certain regular expression:

public class RegexExcludePathFilter implements PathFilter {
    private final String regex;

    public RegexExcludePathFilter(String regex) {
        this.regex = regex;
    }

    public boolean accept(Path path) {
        return !path.toString().matches(regex);
    }
}

You can directly filter your input with a PathFilter implementation by calling FileInputFormat.setInputPathFilter(JobConf, RegexExcludePathFilter.class) when initializing your job.

EDIT: Since you have to pass the class in setInputPathFilter, you can't directly pass arguments, but you should be able to do something similar by playing with the Configuration. If you make your RegexExcludePathFilter also extend from Configured, you can get back a Configuration object which you will have initialized before with the desired values, so you can get back these values inside your filter and process them in the accept.

For example if you initialize like this:

conf.set("date", "2013-01-15");

Then you can define your filter like this:

public class RegexIncludePathFilter extends Configured implements PathFilter {
    private String date;
    private FileSystem fs;

    public boolean accept(Path path) {
        try {
            if (fs.isDirectory(path)) {
                return true;
            }
        } catch (IOException e) {}
        return path.toString().endsWith(date);
    }

    public void setConf(Configuration conf) {
        if (null != conf) {
            this.date = conf.get("date");
            try {
                this.fs = FileSystem.get(conf);
            } catch (IOException e) {}
        }
    }
}

EDIT 2: There were a few issues with the original code, please see the updated class. You also need to remove the constructor since it's not used anymore, and check if that's a directory in which case you should return true so the content of the directory can be filtered too.

这篇关于在MapReduce中使用globStatus过滤输入文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆