如何在hadoop hdfs中列出目录及其子目录中的所有文件 [英] How to list all files in a directory and its subdirectories in hadoop hdfs

查看:82
本文介绍了如何在hadoop hdfs中列出目录及其子目录中的所有文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 hdfs 中有一个文件夹,它有两个子文件夹,每个子文件夹大约有 30 个子文件夹,最后,每个子文件夹都包含 xml 文件.我想列出所有仅提供主文件夹路径的 xml 文件.在本地,我可以使用 apache commons-io's FileUtils.listFiles() 执行此操作.我试过这个

I have a folder in hdfs which has two subfolders each one has about 30 subfolders which,finally,each one contains xml files. I want to list all xml files giving only the main folder's path. Locally I can do this with apache commons-io's FileUtils.listFiles(). I have tried this

FileStatus[] status = fs.listStatus( new Path( args[ 0 ] ) );

但它只列出了前两个子文件夹,并没有更进一步.有没有办法在hadoop中做到这一点?

but it only lists the two first subfolders and it doesn't go further. Is there any way to do this in hadoop?

推荐答案

您需要使用 FileSystem 对象并对生成的 FileStatus 对象执行一些逻辑以手动递归到子目录中.

You'll need to use the FileSystem object and perform some logic on the resultant FileStatus objects to manually recurse into the subdirectories.

您还可以应用 PathFilter 来仅使用 listStatus(Path, PathFilter) 方法

You can also apply a PathFilter to only return the xml files using the listStatus(Path, PathFilter) method

hadoop FsShell 类有关于 hadoop fs -lsr 命令的例子,这是一个递归的 ls - 见 ,大约在第 590 行(递归步骤在第 635 行触发)

The hadoop FsShell class has examples of this for the hadoop fs -lsr command, which is a recursive ls - see the source, around line 590 (the recursive step is triggered on line 635)

这篇关于如何在hadoop hdfs中列出目录及其子目录中的所有文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆