hadoop streaming:在EMR上导入模块 [英] hadoop streaming: importing modules on EMR

查看:146
本文介绍了hadoop streaming:在EMR上导入模块的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

上一个问题解决方法如何导入模块,例如nltk for hadoop streaming。



列出的步骤包括:

  zip -r nltkandyaml.zip nltk yaml 
mv ntlkandyaml.zip /path/to/where/your/mapper/will/be/nltkandyaml.mod
code>

您现在可以导入nltk模块以用于您的Python脚本:
import zipimport

  importer = zipimport.zipimporter('nltkandyaml.mod')
yaml = importer.load_module('yaml')
nltk = importer .load_module('nltk')

我有一份工作要运行亚马逊的EMR ,我不确定把压缩文件放在哪里。我是否需要在boostrapping选项下创建引导脚本,或者我应该将tar.gz放在S3中,然后添加额外的参数吗?我对所有这些都很陌生,并希望得到一个能够引导我完成整个过程的答案,我将非常感激。

您有以下选项:


  1. 创建引导操作脚本并将其放置在S3上。该脚本将以您喜欢的任何格式下载模块,并将其放置在您的映射器/缩减器可访问的位置。要找到你需要放置文件的地方,启动群集的方式是在完成后不要关闭群集,在那里检查目录结构。


  2. 使用mrjob启动您的工作流程。使用mrjob启动作业时,可以指定 bootstrap_python_packages 哪个mrjob将通过解压缩.tar.gz并运行 setup.py install 自动安装。


http: //packages.python.org/mrjob/configs-runners.html



我更喜欢选项2,因为mrjob在开发MapReduce作业方面也有很大帮助蟒蛇。特别是它允许在本地运行(带或不带Hadoop)以及EMR,这简化了调试。


This previous question addressed how to import modules such as nltk for hadoop streaming.

The steps outlined were:

zip -r nltkandyaml.zip nltk yaml
mv ntlkandyaml.zip /path/to/where/your/mapper/will/be/nltkandyaml.mod

You can now import the nltk module for use in your Python script: import zipimport

importer = zipimport.zipimporter('nltkandyaml.mod')
yaml = importer.load_module('yaml')
nltk = importer.load_module('nltk')

I have a job that I want to run on Amazon's EMR, and I'm not sure where to put the zipped files. Do I need to create a bootstrapping script under boostrapping options or should I put the tar.gz's in S3 and then in extra args? I'm pretty new to all this and would appreciate an answer that could walk me through the process would be much appreciated.

解决方案

You have following options:

  1. Create bootstrap action script and place it on S3. This script would download module in whatever format you prefer and place it where it is accessible for your mapper/reducer. To find out the place where exactly you have to put the files, start the cluster in such a way that it will not shut down after completion, ssh there and examine directory structure.

  2. Use mrjob to launch your jobflows. When starting job with mrjob is possible to specify bootstrap_python_packages which mrjob will install automatically by uncompressing .tar.gz and running setup.py install.

http://packages.python.org/mrjob/configs-runners.html

I would prefer option 2 because mrjob also helps a lot in developing MapReduce jobs in Python. In particular it allows to run the jobs locally (with or without Hadoop) as well as on EMR which simplifies debugging.

这篇关于hadoop streaming:在EMR上导入模块的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆