如何在HADOOP中处理多个文件夹 [英] How to Process multiple folders in HADOOP

查看:310
本文介绍了如何在HADOOP中处理多个文件夹的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到以下问题。我有200k xml文件。我有200个文件夹,每个文件夹有2000个xml文件。我在HDFS中有这个。架构低于

  RootFolder 
Folder001
1.xml
2.xml
2000.xml
Folder002
2001.xml

我需要编写一个mapper程序读取文件并执行一些Xpath过程。

如果我给出 RootFolder 输入路径,那么映射器应该读取一个文件夹并进行处理xml文件

即应该有200个Task。并且每个文件夹应该由单个映射器读取

如何处理多个文件夹?

解决方案<根据我的理解,你有两个问题:
$ b

1:需要通过一个map任务映射子文件夹中的所有文件:



答案:您可以在这种情况下使用 CombineFileInputFormat 。它将为指定的 PathFilter (在你的情况下,过滤器应该接受同一文件夹的文件)分组文件,并将它分配给一个单独的maptask。即可以实现每个文件夹的maptask。为了获得更好的控制,请扩展 CombineFileInputFormat ,并将其设为您自己的,这就是我在我的案例中所做的。

2 :需要在子文件夹中包含文件,作为地图任务的输入,只需指定根文件夹。



回答:在新的API版本中, FileInputFormat 可以从其子文件夹中递归地将文件提取到任何级别。
如需更多信息,请点击此处



或者如果你想自己做,子类 FileInputFormat 并覆盖 listStatus 方法。


I'm having the following problem. I have 200k xml files. I have 200 folders and each folder has 2000 xml files. I have this in my HDFS. Architecture is below

RootFolder 
   Folder001
       1.xml
       2.xml
       2000.xml
   Folder002
       2001.xml

I need to write a mapper program to read the files and do some Xpath process.

If I give the RootFolder input path then a mapper should read a folder and process the xml files

That is there should be 200 Task. And each folder should be read by a single mapper

How to process multiple folders?

解决方案

From my understanding you have 2 problems:

1: Need to map all files in a subfolder by a single map task:

Ans: You can make use of CombineFileInputFormat for this scenario. It will group files for a specified PathFilter (in your case , filter should accept files of same folder) and will assign it to a single maptask. i.e, maptask per folder can be achieved. To get better control please extend CombineFileInputFormat and make it your own , that what I do in my case.

2: Need to include files inside the subfolders too as input for your map task(s), by specifying only the root folder.

Ans: In the new API releases, FileInputFormat can take files recursively from its subfolders up to any level. For more info you can see the jira here.

Or if you want to do it yourself, subclass FileInputFormat and override listStatus method.

这篇关于如何在HADOOP中处理多个文件夹的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆