Hadoop的FileSystem中的通配符列出了API调用 [英] Wildcard in Hadoop's FileSystem listing API calls

查看:146
本文介绍了Hadoop的FileSystem中的通配符列出了API调用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为了能够在列出的路径中使用通配符(globs),只需使用 tl; dr: a href =http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#globStatus(org.apache.hadoop.fs.Path,%20org.apache.hadoop .fs.PathFilter)rel =nofollow noreferrer> globStatus(...) 而不是 listStatus(...)






上下文



我的HDFS集群上的文件是按分区进行组织的,日期是分区的根分区。文件结构的简化示例如下所示:

  / schemas_folder 
├──date = 20140101
│├──A-schema.avsc
│├──B-schema.avsc
├──date = 20140102
│├──A-schema.avsc
-├──B-schema.avsc
│├──C-schema.avsc
└──date = 20140103
├──B-schema.avsc
└ ──C-schema.avsc

在我的情况下, //avro.apache.org/rel =nofollow noreferrer> Avro 架构,用于不同日期的不同类型数据(本例中为A,B和C)。该模式可能会开始存在,发展和停止现有的......随着时间的推移。






目标



我需要尽可能快地获取给定类型的所有模式。在这个例子中,我想获得A类型的所有模式,我想这样做:

  hdfs dfs -ls /schemas_folder/date=*/A-schema.avsc 

这会给我

 找到1个项目
-rw-r - r-- 3个用户组1234 2014-01-01 12: 34 /schemas_folder/date=20140101/A-schema.avsc
找到1项
-rw-r - r-- 3用户组2345 2014-01-02 23:45 / schemas_folder / date = 20140102 / A-schema.avsc






h1>

我不想使用shell命令,并且似乎无法在Java API中找到与上面的命令相同的命令。当我尝试自己实现循环时,我得到了糟糕的表现。我至少需要 命令行的性能(在我的情况下,大约 3秒)...




我到目前为止发现



可以注意到它打印两次找到1项,每次结果前一次。它不会在开始时打印找到2个项目。这可能暗示通配符未在 FileSystem 端实现,但以某种方式由客户端处理。我似乎无法找到正确的源代码来查看它是如何实现的。



下面是我的第一张照片,可能有点太天真了......

使用listFiles(...)



代码:



  RemoteIterator< LocatedFileStatus> files = filesystem.listFiles(new Path(/ schemas_folder),true); 
Pattern pattern = Pattern.compile(^。* / date = [0-9] {8} /A-schema\\.avsc$);
while(files.hasNext()){
Path path = files.next()。getPath();
if(pattern.matcher(path.toString())。matches())
{
System.out.println(path);




$ h $结果:

这打印完全符合我的期望,但由于它首先递归地列出所有内容,然后过滤,所以性能非常差。使用我当前的数据集,它几乎需要 25秒 ...



使用listStatus(...)



代码:



  FileStatus [] statuses = filesystem.listStatus(new Path(/ schemas_folder) ,新的PathFilter()
{
private final Pattern pattern = Pattern.compile(^ date = [0-9] {8} $);

@Override
public boolean accept(Path path)
{
return pattern.matcher(path.getName())。matches();
}
});
Path [] paths = new Path [statuses.length];
for(int i = 0; i< statuses.length; i ++){paths [i] = statuses [i] .getPath(); }
statuses = filesystem.listStatus(paths,new PathFilter()
{
@Override
public boolean accept(Path path)
{
return A-schema.avsc.equals(path.getName());
}
}); (FileStatus状态:状态)

{
System.out.println(status.getPath());



结果:



感谢 PathFilter 和数组的使用,它似乎执行得更快(大约 12秒)。代码更复杂,但更难以适应不同的情况。最重要的是,性能仍然比命令行版本慢3到4倍!






问题



我在这里错过了什么?获得我想要的结果的最快方法是什么?






更新



2014.07.09 - 13:38



建议 Mukesh S 的答案显然是最好的API方法。



在上面给出的示例中,代码最终看起来像这样:

  FileStatus [] statuses = filesystem.globStatus(new Path(/ schemas_folder / date = * / A-schema.avsc)); (FileStatus状态:状态)

{
System.out.println(status.getPath());
}

这是迄今为止我能想出的最好看,性能最好的代码,但仍然不如shell版本。

解决方案

除了listStatus,您可以尝试hadoops globStatus。 Hadoop提供了两种用于处理globs的FileSystem方法:

  public FileStatus [] globStatus(Path pathPattern)throws IOException 
public FileStatus [] globStatus(Path pathPattern,PathFilter filter)throws IOException

可以指定一个可选的PathFilter来限制进一步匹配。



有关更多描述,您可以查看Hadoop:权威指南这里

希望它有帮助.. !! !

tl;dr: To be able to use wildcards (globs) in the listed paths, one simply has to use globStatus(...) instead of listStatus(...).


Context

Files on my HDFS cluster are organized in partitions, with the date being the "root" partition. A simplified example of the files structure would look like this:

/schemas_folder
├── date=20140101
│   ├── A-schema.avsc
│   ├── B-schema.avsc
├── date=20140102
│   ├── A-schema.avsc
│   ├── B-schema.avsc
│   ├── C-schema.avsc
└── date=20140103
    ├── B-schema.avsc
    └── C-schema.avsc

In my case, the directory stores Avro schemas for different types of data (A, B and C in this example) at different dates. The schema might start existing, evolve and stop existing... as time passes.


Goal

I need to be able to get all the schemas that exist for a given type, as quickly as possible. In the example where I would like to get all the schemas that exist for type A, I would like to do the following:

hdfs dfs -ls /schemas_folder/date=*/A-schema.avsc

That would give me

Found 1 items
-rw-r--r--   3 user group 1234 2014-01-01 12:34 /schemas_folder/date=20140101/A-schema.avsc
Found 1 items
-rw-r--r--   3 user group 2345 2014-01-02 23:45 /schemas_folder/date=20140102/A-schema.avsc


Problem

I don't want to be using the shell command, and cannot seem to find the equivalent to that command above in the Java APIs. When I try to implement the looping myself, I get terrible performance. I want at least the performance of the command line (around 3 seconds in my case)...


What I found so far

One can notice that it prints twice Found 1 items, once before each result. It does not print Found 2 items once at the beginning. That probably hints that wildcards are not implemented on the FileSystem side but somehow handled by the client. I can't seem to find the right source code to look at to see how that was implemented.

Below are my first shots, probably a bit too naïve...

Using listFiles(...)

Code:

RemoteIterator<LocatedFileStatus> files = filesystem.listFiles(new Path("/schemas_folder"), true);
Pattern pattern = Pattern.compile("^.*/date=[0-9]{8}/A-schema\\.avsc$");
while (files.hasNext()) {
    Path path = files.next().getPath();
    if (pattern.matcher(path.toString()).matches())
    {
        System.out.println(path);
    }
}

Result:

This prints exactly what I would expect, but since it first lists everything recursively and then filters, the performance is really poor. With my current dataset, it takes almost 25 seconds...

Using listStatus(...)

Code:

FileStatus[] statuses = filesystem.listStatus(new Path("/schemas_folder"), new PathFilter()
{
    private final Pattern pattern = Pattern.compile("^date=[0-9]{8}$");

    @Override
    public boolean accept(Path path)
    {
        return pattern.matcher(path.getName()).matches();
    }
});
Path[] paths = new Path[statuses.length];
for (int i = 0; i < statuses.length; i++) { paths[i] = statuses[i].getPath(); }
statuses = filesystem.listStatus(paths, new PathFilter()
{
    @Override
    public boolean accept(Path path)
    {
        return "A-schema.avsc".equals(path.getName());
    }
});
for (FileStatus status : statuses)
{
    System.out.println(status.getPath());
}

Result:

Thanks to the PathFilters and the use of arrays, it seems to perform faster (around 12 seconds). The code is more complex, though, and more difficult to adapt to different situations. Most importantly, the performance is still 3 to 4 times slower than the command-line version!


Question

What am I missing here? What is the fastest way to get the results I want?


Updates

2014.07.09 - 13:38

The proposed answer of Mukesh S is apparently the best possible API approach.

In the example I gave above, the code end-up looking like this:

FileStatus[] statuses = filesystem.globStatus(new Path("/schemas_folder/date=*/A-schema.avsc"));
for (FileStatus status : statuses)
{
    System.out.println(status.getPath());
}

This is the best looking and best performing code I could come up with so far, but is still not performing as well as the shell version.

解决方案

Instead of listStatus you can try hadoops globStatus. Hadoop provides two FileSystem method for processing globs:

public FileStatus[] globStatus(Path pathPattern) throws IOException
public FileStatus[] globStatus(Path pathPattern, PathFilter filter) throws IOException

An optional PathFilter can be specified to restrict the matches further.

For more description you can check Hadoop:Definitive Guide here

Hope it helps..!!!

这篇关于Hadoop的FileSystem中的通配符列出了API调用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆