AWS使用IAM角色从Spark访问s3 [英] aws access s3 from spark using IAM role

查看:397
本文介绍了AWS使用IAM角色从Spark访问s3的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从spark访问s3,我不想配置任何秘密和访问密钥,我想通过配置IAM角色进行访问,所以我遵循

I want to access s3 from spark, I don't want to configure any secret and access keys, I want to access with configuring the IAM role, so I followed the steps given in s3-spark

但是仍然无法通过我的EC2实例(正在运行独立的Spark)运行

But still it is not working from my EC2 instance (which is running standalone spark)

它在我测试时有效

[ec2-user@ip-172-31-17-146 bin]$ aws s3 ls s3://testmys3/
2019-01-16 17:32:38        130 e.json

但是当我尝试如下操作时它不起作用

but it did not work when I tried like below

scala> val df = spark.read.json("s3a://testmys3/*")

我收到以下错误

19/01/16 18:23:06 WARN FileStreamSink: Error while looking for metadata directory.
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400, AWS Service: Amazon S3, AWS Request ID: E295957C21AFAC37, AWS Error Code: null, AWS Error Message: Bad Request
  at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798)
  at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421)
  at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
  at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
  at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
  at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
  at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
  at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
  at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
  at org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:616)

推荐答案

"400错误的请求"是毫无帮助的,不仅S3提供的内容不多,而且S3A连接器的打印日期也与auth无关.有关对错误进行故障排除的内容很大

"400 Bad Request" is fairly unhelpful, and not only does S3 not provide much, the S3A connector doesn't date print much related to auth either. There's a big section on troubleshooting the error

它可以发出请求的事实意味着它具有一些凭据,只有远端不喜欢它们

The fact it got as far as making a request means that it has some credentials, only the far end doesn't like them

可能性

  • 您的IAM角色没有s3:ListBucket的权限.请参阅 IAM使用s3a的角色权限
  • 您的存储桶名称错误
  • fs.s3a或AWS_ env vars中的某些设置优先于IAM角色,并且它们是错误的.

您应该自动将IAM身份验证作为带有S3A连接器的身份验证机制;它是最后检查一次的:config&环境变量.

You should automatically have IAM auth as an authentication mechanism with the S3A connector; its the one which is checked last after: config & env vars.

  1. 看看fs.s3a.aws.credentials.provider中的设置-必须取消设置或包含选项com.amazonaws.auth.InstanceProfileCredentialsProvider
  2. 假设您在命令行上也有hadoop,请抓紧 storediag
  1. Have a look at what is set in fs.s3a.aws.credentials.provider -it must be unset or contain the option com.amazonaws.auth.InstanceProfileCredentialsProvider
  2. assuming you also have hadoop on the command line, grab storediag

hadoop jar cloudstore-0.1-SNAPSHOT.jar storediag s3a://testmys3/

它应该转储有关身份验证的内容.

it should dump what it is up to regarding authentication.

更新

正如原始发布者所评论的,这是由于特定S3端点上需要进行v4身份验证.可以在s3a客户端的2.7.x版本上启用此功能,但只能通过Java系统属性启用.对于2.8+,有一些fs.s3a.选项,您可以改为设置

As the original poster has commented, it was due to v4 authentication being required on the specific S3 endpoint. This can be enabled on the 2.7.x version of the s3a client, but only via Java system properties. For 2.8+ there are some fs.s3a. options you can set it instead

这篇关于AWS使用IAM角色从Spark访问s3的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆