将目录不是文件传递给hadoop-streaming? [英] Pass directories not files to hadoop-streaming?

查看:237
本文介绍了将目录不是文件传递给hadoop-streaming?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的工作中,我需要解析许多历史日志。个别客户(有成千上万个)可能有数百个按日期分列的日志子目录。例如:


  • logs / Customer_One / 2011-01-02-001

  • logs / Customer_One / 2012-02-03-001

  • logs / Customer_One / 2012-02-03-002

  • logs / Customer_Two / 2009-03-03 -001

  • 记录/ Customer_Two / 2009-03-03-002


每个日志设置本身可能是五六层,包含数千个文件。

因此,我实际上希望单个地图作业处理子目录的行为:简单地枚举单个文件是我的分布式计算问题的一部分!



不幸的是,当我尝试将一个仅包含日志子目录的目录传递给Hadoop时,它抱怨我无法将这些子目录传递给我的映射器。 (同样,我写了接受子目录作为输入):
$ b $ p $ hadoop jar $ {HADOOP_HOME} / contrib / streaming / hadoop-streaming-$ {HADOOP_VERSION} .jar - 输入文件:/// mnt / logs / Customer_Name /文件mapper.sh -mappermapper.sh-file reducer.sh -reducerreducer.sh-output 。


[。 。 。 ]



12/04/10 12:48:35错误security.UserGroupInformation:PriviledgedActionException as:cloudera(auth:SIMPLE)cause:java.io.IOException:Not a文件:文件:/ mnt / logs / Customer_Name / 2011-05-20-003
12/04/10 12:48:35错误streaming.StreamJob:错误启动作业:不是文件:file:/ mnt / logs / Customer_Name / 2011-05-20-003
Streaming命令失败!
[cloudera @ localhost〜] $



有没有简单的方式说服Hadoop-streaming允许我将目录分配为工作项目?

解决方案

我想你需要研究编写一个自定义的InputFormat,你可以传递根目录,它会创建为每个客户分割,然后每个分割的记录读取器将执行目录遍历并将文件内容推送到映射器中。

In my job, I have the need to parse many historical logsets. Individual customers (there are thousands) may have hundreds of log subdirectories broken out by date. For example:

  • logs/Customer_One/2011-01-02-001
  • logs/Customer_One/2012-02-03-001
  • logs/Customer_One/2012-02-03-002
  • logs/Customer_Two/2009-03-03-001
  • logs/Customer_Two/2009-03-03-002

Each individual log set may itself be five or six levels deep and contain thousands of files.

Therefore, I actually want the individual map jobs to handle walking the subdirectories: simply enumerating individual files is part of my distributed computing problem!

Unfortunately, when I try passing a directory containing only log subdirectories to Hadoop, it complains that I can't pass those subdirectories to my mapper. (Again, I have written to accept subdirectories as input):

$ hadoop jar "${HADOOP_HOME}/contrib/streaming/hadoop-streaming-${HADOOP_VERSION}.jar" -input file:///mnt/logs/Customer_Name/ -file mapper.sh -mapper "mapper.sh" -file reducer.sh -reducer "reducer.sh" -output .

[ . . . ]

12/04/10 12:48:35 ERROR security.UserGroupInformation: PriviledgedActionException as:cloudera (auth:SIMPLE) cause:java.io.IOException: Not a file: file:/mnt/logs/Customer_Name/2011-05-20-003 12/04/10 12:48:35 ERROR streaming.StreamJob: Error Launching job : Not a file: file:/mnt/logs/Customer_Name/2011-05-20-003 Streaming Command Failed! [cloudera@localhost ~]$

Is there a straightforward way to convince Hadoop-streaming to permit me to assign directories as work items?

解决方案

I guess you need to investigate writing a custom InputFormat which you can pass the root directory too, it will create a split for each customer, and then the record reader for each split will do the directory walk and push the file contents to your mappers

这篇关于将目录不是文件传递给hadoop-streaming?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆