EMR如何处理输入和输出的s3存储桶? [英] How does EMR handle an s3 bucket for input and output?

查看:255
本文介绍了EMR如何处理输入和输出的s3存储桶?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在整理一个EMR群集,并创建了在EMR文档中,但是我应该如何上传数据并从中读取数据?在我的spark提交步骤中,我说使用s3://myclusterbucket/scripts/script.py的脚本名称是否输出不会自动上传到s3?如何处理依赖关系?我尝试使用指向s3存储桶中的依赖项zip的pyfiles,但始终返回找不到文件"

I'm spinning up an EMR cluster and I've created the buckets specified in the EMR docs, but how should I upload data and read from it? In my spark submit step I say the script name using s3://myclusterbucket/scripts/script.py Is output not automatically uploaded to s3? How are dependencies handled? I've tried using the pyfiles pointing to a dependency zip inside the s3 bucket, but keep getting back 'file not found'

推荐答案

由于EMRFS(基于S3的AWS专有Hadoop文件系统实现),EMR中的MapReduce或Tez作业可以直接访问S3,例如,在Apache Pig中,您可以执行 loaded_data = LOAD 's3://mybucket/myfile.txt' USING PigStorage();

MapReduce or Tez jobs in EMR can access S3 directly because of EMRFS (an AWS propriertary Hadoop filesystem implementation based on S3), e.g., in Apache Pig you can do loaded_data = LOAD 's3://mybucket/myfile.txt' USING PigStorage();

不确定基于Python的Spark作业.但是一种解决方案是先将对象从S3复制到EMR HDFS,然后在此处进行处理.

Not sure about Python-based Spark jobs. But one solution is to first copy the objects from S3 to the EMR HDFS, and then process them there.

有多种复制方法:

  • 使用hadoop fs命令将对象从S3复制到EMR HDFS(反之亦然),例如hadoop fs -cp s3://mybucket/myobject hdfs://mypath_on_emr_hdfs

  • Use hadoop fs commands to copy objects from S3 to the EMR HDFS (and vice versa), e.g., hadoop fs -cp s3://mybucket/myobject hdfs://mypath_on_emr_hdfs

使用s3-dist-cp将对象从S3复制到EMR HDFS(反之亦然)

Use s3-dist-cp to copy objects from S3 to the EMR HDFS (and vice versa) http://docs.aws.amazon.com/emr/latest/ReleaseGuide/UsingEMR_s3distcp.html

您还可以使用awscli(或hadoop fs -copyToLocal)将对象从S3复制到EMR主实例本地磁盘(反之亦然),例如aws s3 cp s3://mybucket/myobject .

You can also use awscli (or hadoop fs -copyToLocal) to copy objects from S3 to the EMR master instance local disk (and vice versa), e.g., aws s3 cp s3://mybucket/myobject .

这篇关于EMR如何处理输入和输出的s3存储桶?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆