使用多个S3帐户运行EMR Spark [英] Running EMR Spark With Multiple S3 Accounts

查看:89
本文介绍了使用多个S3帐户运行EMR Spark的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个EMR Spark作业,需要从一个帐户中的S3读取数据并向另一个帐户写入.
我将工作分为两个步骤.

I have an EMR Spark Job that needs to read data from S3 on one account and write to another.
I split my job into two steps.

  1. 从S3读取数据(不需要证书,因为我的EMR群集位于同一帐户中).

  1. read data from the S3 (no credentials required because my EMR cluster is in the same account).

读取步骤1创建的本地HDFS中的数据,并将其写入另一个帐户中的S3存储桶.

read data in the local HDFS created by step 1 and write it to an S3 bucket in another account.

我尝试设置hadoopConfiguration:

sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "<your access key>")
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey","<your secretkey>")

并导出群集上的密钥:

$ export AWS_SECRET_ACCESS_KEY=
$ export AWS_ACCESS_KEY_ID=

我尝试了集群 client 模式以及 spark-shell 都没有运气.

I've tried both cluster and client mode as well as spark-shell with no luck.

每个返回一个错误:

ERROR ApplicationMaster: User class threw exception: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: 
Access Denied

推荐答案

解决方案实际上很简单.

The solution is actually quite simple.

首先,EMR群集具有两个作用:

Firstly, EMR clusters have two roles:

  • 服务角色(EMR_DefaultRole),该角色向EMR服务授予权限(例如,用于启动Amazon EC2实例)
  • 连接到集群中启动的EC2实例的 EC2角色(EMR_EC2_DefaultRole),使他们可以访问AWS凭证(请参阅
  • A service role (EMR_DefaultRole) that grants permissions to the EMR service (eg for launching Amazon EC2 instances)
  • An EC2 role (EMR_EC2_DefaultRole) that is attached to EC2 instances launched in the cluster, giving them access to AWS credentials (see Using an IAM Role to Grant Permissions to Applications Running on Amazon EC2 Instances)

这些角色的解释如下:默认IAM Amazon EMR的角色

These roles are explained in: Default IAM Roles for Amazon EMR

因此,集群中启动的每个EC2实例都被分配了EMR_EC2_DefaultRole角色,该角色通过实例元数据服务使临时凭据可用. (有关其工作原理的说明,请参见:

Therefore, each EC2 instance launched in the cluster is assigned the EMR_EC2_DefaultRole role, which makes temporary credentials available via the Instance Metadata service. (For an explanation of how this works, see: IAM Roles for Amazon EC2.) Amazon EMR nodes use these credentials to access AWS services such as S3, SNS, SQS, CloudWatch and DynamoDB.

第二,您需要向其他帐户中的Amazon S3存储桶添加权限,以允许通过EMR_EC2_DefaultRole角色进行访问.可以通过向S3存储桶(此处命名为other-account-bucket)添加存储桶策略来完成此操作,

Secondly, you will need to add permissions to the Amazon S3 bucket in the other account to permit access via the EMR_EC2_DefaultRole role. This can be done by adding a bucket policy to the S3 bucket (here named other-account-bucket) like this:

{
    "Id": "Policy1",
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Stmt1",
            "Action": "s3:*",
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::other-account-bucket",
                "arn:aws:s3:::other-account-bucket/*"
            ],
            "Principal": {
                "AWS": [
                    "arn:aws:iam::ACCOUNT-NUMBER:role/EMR_EC2_DefaultRole"
                ]
            }
        }
    ]
}

此策略将所有S3权限(s3:*)授予EMR_EC2_DefaultRole角色,该角色属于与该策略中ACCOUNT-NUMBER匹配的帐户,该帐户应该是启动EMR群集的帐户.授予此类权限时请小心-您可能只想授予GetObject权限,而不是授予所有S3权限.

This policy grants all S3 permissions (s3:*) to the EMR_EC2_DefaultRole role that belongs to the account matching the ACCOUNT-NUMBER in the policy, which should be the account in which the EMR cluster was launched. Be careful when granting such permissions -- you might want to grant permissions only to GetObject rather than granting all S3 permissions.

仅此而已!另一个帐户中的存储桶现在将接受来自EMR节点的请求,因为它们使用的是EMR_EC2_DefaultRole角色.

That's all! The bucket in the other account will now accept requests from the EMR nodes because they are using the EMR_EC2_DefaultRole role.

免责声明:我通过在帐户A中创建存储桶并将权限(如上所示)分配给帐户B中的角色,对上述内容进行了测试.具有该角色的EC2实例在帐户B中启动.我可以通过 AWS命令行界面(CLI)从EC2实例访问存储桶.我没有在EMR中对其进行测试,但是它应该以相同的方式工作.

Disclaimer: I tested the above by creating a bucket in Account-A and assigning permissions (as shown above) to a role in Account-B. An EC2 instance was launched in Account-B with that role. I was able to access the bucket from the EC2 instance via the AWS Command-Line Interface (CLI). I did not test it within EMR, however it should work the same way.

这篇关于使用多个S3帐户运行EMR Spark的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆