Spark Scala S3存储:权限被拒绝 [英] Spark Scala S3 storage: permission denied
问题描述
我已经在Internet上阅读了很多有关如何使Spark与S3一起工作的主题,但是仍然无法正常工作.我已经下载了:
I've read a lot of topic on Internet on how to get working Spark with S3 still there's nothing working properly. I've downloaded : Spark 2.3.2 with hadoop 2.7 and above.
我仅从Hadoop 2.7.7(与Spark/Hadoop版本匹配)中复制了一些库到Spark jars文件夹:
I've copied only some libraries from Hadoop 2.7.7 (which matches Spark/Hadoop version) to Spark jars folder:
- hadoop-aws-2.7.7.jar
- hadoop-auth-2.7.7.jar
- aws-java-sdk-1.7.4.jar
仍然无法使用S3N或S3A来通过Spark读取我的文件:
Still I can't use nor S3N nor S3A to get my file read by spark:
对于S3A,我有以下例外情况:
For S3A I have this exception:
sc.hadoopConfiguration.set("fs.s3a.access.key","myaccesskey")
sc.hadoopConfiguration.set("fs.s3a.secret.key","mysecretkey")
val file = sc.textFile("s3a://my.domain:8080/test_bucket/test_file.txt")
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: AE203E7293ZZA3ED, AWS Error Code: null, AWS Error Message: Forbidden
使用此 Python片段,以及更多代码,我可以列出我的存储分区,列出我的文件,下载文件,从我的计算机中读取文件并获取文件url.这段代码为我提供了以下文件网址:
Using this piece of Python, and some more code, I can list my buckets, list my files, download files, read files from my computer and get file url. This code gives me the following file url:
我应该如何安装/设置/下载才能从我的S3服务器读取和写入spark?
How should I install / set up / download to get spark able to read and write from my S3 server ?
修改3:
在评论中使用调试工具5l7yf"rel =" nofollow noreferrer>结果.
似乎问题出在签名上,不知道是什么意思.
Using debug tool in comment here's the result.
Seems like the issue is with a signature thing not sure what it means.
推荐答案
首先,您需要下载与spark-hadoop版本的安装相匹配的aws-hadoop.jar和aws-java-sdk.jar并将它们添加到spark文件夹中的 jars
文件夹.
然后,如果您的S3服务器不支持动态DNS,则需要调整将要使用的服务器并启用路径样式:
First you will need to download aws-hadoop.jar and aws-java-sdk.jar that matches the install of your spark-hadoop release and add them to the jars
folder inside spark folder.
Then you will need to precise the server you will use and enable path style if your S3 server do not support dynamic DNS:
sc.hadoopConfiguration.set("fs.s3a.path.style.access","true")
sc.hadoopConfiguration.set("fs.s3a.endpoint","my.domain:8080")
#I had to change signature version because I have an old S3 api implementation:
sc.hadoopConfiguration.set("fs.s3a.signing-algorithm","S3SignerType")
这是我的最终代码:
sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
val tmp = sc.textFile("s3a://test_bucket/test_file.txt")
sc.hadoopConfiguration.set("fs.s3a.access.key","mykey")
sc.hadoopConfiguration.set("fs.s3a.secret.key","mysecret")
sc.hadoopConfiguration.set("fs.s3a.endpoint","my.domain:8080")
sc.hadoopConfiguration.set("fs.s3a.connection.ssl.enabled","true")
sc.hadoopConfiguration.set("fs.s3a.path.style.access","true")
sc.hadoopConfiguration.set("fs.s3a.signing-algorithm","S3SignerType")
tmp.count()
我建议将大多数设置放入 spark-defaults.conf
:
I would recommand to put most of the settings inside spark-defaults.conf
:
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.path.style.access true
spark.hadoop.fs.s3a.endpoint mydomain:8080
spark.hadoop.fs.s3a.connection.ssl.enabled true
spark.hadoop.fs.s3a.signing-algorithm S3SignerType
我遇到的一个问题是将 spark.hadoop.fs.s3a.connection.timeout
设置为10,但是此值在Hadoop 3之前以毫秒为单位设置,这给了您一个非常重要的意义.超时时间长;尝试读取文件后1.5分钟,错误消息就会出现.
One of the issue I had has been to set spark.hadoop.fs.s3a.connection.timeout
to 10 but this value is set in millisecond prior to Hadoop 3 and it gives you a very long timeout; error message would appear 1.5 minute after the attempt to read a file.
PS:
特别感谢 Steve Loughran .
非常感谢您的宝贵帮助.
PS:
Special thanks to Steve Loughran.
Thank you a lot for the precious help.
这篇关于Spark Scala S3存储:权限被拒绝的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!