引导时将Python文件从S3复制/使用到Amazon Elastic MapReduce [英] Copying/using Python files from S3 to Amazon Elastic MapReduce at bootstrap time

查看:110
本文介绍了引导时将Python文件从S3复制/使用到Amazon Elastic MapReduce的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经弄清楚了如何使用boto在引导步骤中安装python软件包(numpy等),以及如何仍然使用boto将文件从S3复制到我的EC2实例.

I've figured out how to install python packages (numpy and such) at the bootstrapping step using boto, as well as copying files from S3 to my EC2 instances, still with boto.

我还没有弄清楚如何使用boto从S3存储桶向每个EMR实例分发python脚本(或任何文件).有指针吗?

What I haven't figured out is how to distribute python scripts (or any file) from S3 buckets to each EMR instance using boto. Any pointers?

推荐答案

如果您使用的是boto,建议您将所有Python文件打包为归档文件(.tar.gz格式),然后在Hadoop/EMR中使用cacheArchive指令来访问它.

If you are using boto, I recommend packaging all your Python files in an archive (.tar.gz format) and then using the cacheArchive directive in Hadoop/EMR to access it.

这就是我要做的:

  1. 将所有必要的Python文件放在一个子目录(例如"required/")中,并在本地对其进行测试.
  2. 为此创建一个存档:cd required&& tar czvf required.tgz *
  3. 将此存档上传到S3:需要s3cmd.tgzs3://yourBucket/required.tgz
  4. 将此命令行选项添加到您的步骤中:-cacheArchive s3://yourBucket/required.tgz#required

最后一步将确保您的包含Python代码的存档文件与本地开发机中的目录格式相同.

The last step will ensure that your archive file containing Python code will be in the same directory format as in your local dev machine.

要实际执行boto中的第4步,请使用以下代码:

To actually do step #4 in boto, here is the code:

step = StreamingStep(name=jobName,
  mapper='...',
  reducer='...',
  ...
  cache_archives=["s3://yourBucket/required.tgz#required"],
)
conn.add_jobflow_steps(jobID, [step])

并且为了允许Python中导入的代码在您的映射器中正常工作,请确保像引用子目录一样引用它:

And to allow for the imported code in Python to work properly in your mapper, make sure to reference it as you would a sub-directory:

sys.path.append('./required')
import myCustomPythonClass

# Mapper: do something!

这篇关于引导时将Python文件从S3复制/使用到Amazon Elastic MapReduce的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆