Hadoop的流:映射器“包裹”二进制可执行 [英] Hadoop Streaming: Mapper 'wrapping' a binary executable

查看:194
本文介绍了Hadoop的流:映射器“包裹”二进制可执行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个管道,我现在一所大学的计算机集群上运行。出版目的,我想将其转换成马preduce格式,它可以由任何人运行使用Hadoop集群,如亚马逊Web服务(AWS)。该管道目前包括为一系列Python脚本的包不同的二进制可执行文件并管理使用Python子和临时文件模块的输入和输出。不幸的是我没有写的二进制可执行文件,其中许多人要么都不需要STDIN或者不以'可用'的方式发出标准输出(例如,只有它发送到文件)。这些问题为什么我已经包裹其中大部分在蟒蛇。

I have a pipeline that I currently run on a large university computer cluster. For publication purposes I'd like to convert it into mapreduce format such that it could be run by anyone on using a hadoop cluster such as amazon webservices (AWS). The pipeline currently consists of as series of python scripts that wrap different binary executables and manage the input and output using the python subprocess and tempfile modules. Unfortunately I didn’t write the binary executables and many of them either don’t take STDIN or don't emit STDOUT in a ‘useable’ fashion (e.g., only sent it to files). These problems are why I’ve wrapped most of them in python.

到目前为止,我已经能够修改我的Python code这样的,我有一个映射,并且我可以在我的标准的本地计算机上运行减速的考试形式。

So far I’ve been able to modify my Python code such that I have a mapper and a reducer that I can run on my local machine in the standard ‘test format.’

$ cat data.txt | mapper.py | reducer.py

的映射格式,每行的数据它包装需要它的二进制文件,发送文本使用subprocess.popen二进制(这也让我掩盖了很多虚假的标准输出),然后收集我要粗壮的方式,并格式化为适当减速的文本行。
当我尝试在本地安装的Hadoop复制命令出现的问题。我可以映射器来执行,但它给了暗示,它无法找到二进制可执行文件错误。

The mapper formats each line of data the way the binary it wraps wants it, sends the text to the binary using subprocess.popen (this also allows me to mask a lot of spurious STDOUT), then collects the STOUT I want, and formats it into lines of text appropriate for the reducer. The problems arise when I try to replicate the command on a local hadoop install. I can get the mapper to execute, but it give an error that suggests that it can’t find the binary executable.

文件
  /Users/me/Desktop/hadoop-0.21.0/./phyml.py
  第69行,在
      main()的文件/Users/me/Desktop/hadoop-0.21.0/./mapper.py
  线66,主
      phyml(无)文件/Users/me/Desktop/hadoop-0.21.0/./mapper.py
  第46行,在phyml
      FT = POPEN(cli_parts,标准输入=管,标准错误=管,标准输出= PIPE)文件
  /Library/Frameworks/Python.framework/Versions/6.1/lib/python2.6/subprocess.py
  线621,在的init
      errread,ERRWRITE)文件/Library/Frameworks/Python.framework/Versions/6.1/lib/python2.6/subprocess.py
  线1126,在_execute_child
      提高child_exception
  OSERROR:[错误13]许可被拒绝

File "/Users/me/Desktop/hadoop-0.21.0/./phyml.py", line 69, in main() File "/Users/me/Desktop/hadoop-0.21.0/./mapper.py", line 66, in main phyml(None) File "/Users/me/Desktop/hadoop-0.21.0/./mapper.py", line 46, in phyml ft = Popen(cli_parts, stdin=PIPE, stderr=PIPE, stdout=PIPE) File "/Library/Frameworks/Python.framework/Versions/6.1/lib/python2.6/subprocess.py", line 621, in init errread, errwrite) File "/Library/Frameworks/Python.framework/Versions/6.1/lib/python2.6/subprocess.py", line 1126, in _execute_child raise child_exception OSError: [Errno 13] Permission denied

我的hadoop的命令如下所示:

My hadoop command looks like the following:

./bin/hadoop jar /Users/me/Desktop/hadoop-0.21.0/mapred/contrib/streaming/hadoop-0.21.0-streaming.jar \
-input /Users/me/Desktop/Code/AWS/temp/data.txt \
-output /Users/me/Desktop/aws_test \
-mapper  mapper.py \
-reducer  reducer.py \
-file /Users/me/Desktop/Code/AWS/temp/mapper.py \
-file /Users/me/Desktop/Code/AWS/temp/reducer.py \
-file /Users/me/Desktop/Code/AWS/temp/binary

正如我上面提到它看起来对我来说,映射器不知道的二进制 - 也许它没有被发送到计算节点?不幸的是我真的不能告诉是什么问题。任何帮助将大大AP preciated。这将是格外高兴看到用Python编写的包二进制可执行一些Hadoop的数据流映射器/减速。我无法想象我是第一个去尝试做到这一点!事实上,这里是另一篇文章要求基本上是相同的问题,但它尚未回答...

As I noted above it looks to me like the mapper isn't aware of the binary - perhaps it's not being sent to the compute node? Unfortunately I can't really tell what the problem is. Any help would be greatly appreciated. It would be particulary nice to see some hadoop streaming mappers/reducers written in python that wrap binary executables. I can’t imagine I’m the first one to try to do this! In fact, here is another post asking essentially the same question, but it hasn't been answered yet...

<一个href=\"http://stackoverflow.com/questions/4101815/hadoop-elastic-map-reduce-with-binary-executable\">Hadoop/Elastic地图二进制可执行减少?

推荐答案

很多谷歌搜索(等)我想出如何来包含可执行二进制文件/脚本/模块,对你的映射器/减速器访问后。诀窍是所有上传文件,你首先要的Hadoop。

After much googling (etc.) I figured out how to include executable binaries/scripts/modules that are accessible to your mappers/reducers. The trick is to upload all you files to hadoop first.

$ bin/hadoop dfs -copyFromLocal /local/file/system/module.py module.py

然后,你需要格式化流命令类似下面的模板:

Then you need to format you streaming command like the following template:

$ ./bin/hadoop jar /local/file/system/hadoop-0.21.0/mapred/contrib/streaming/hadoop-0.21.0-streaming.jar \
-file /local/file/system/data/data.txt \
-file /local/file/system/mapper.py \
-file /local/file/system/reducer.py \
-cacheFile hdfs://localhost:9000/user/you/module.py#module.py \
-input data.txt \
-output output/ \
-mapper mapper.py \
-reducer reducer.py \
-verbose

如果你要链接一个Python模块,你需要以下code添加到您的映射器/减速脚本:

If you're linking a python module you'll need to add the following code to your mapper/reducer scripts:

import sys 
sys.path.append('.')
import module

如果您通过subprocessing您的命令应该是这个样子访问二进制文件:

If you're accessing a binary via subprocessing your command should look something like this:

cli = "./binary %s" % (argument)
cli_parts = shlex.split(cli)
mp = Popen(cli_parts, stdin=PIPE, stderr=PIPE, stdout=PIPE)
mp.communicate()[0]

希望这有助于。

这篇关于Hadoop的流:映射器“包裹”二进制可执行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆