无法使用Spark从S3存储桶读取 [英] Unable to read from s3 bucket using spark
问题描述
val spark = SparkSession
.builder()
.appName("try1")
.master("local")
.getOrCreate()
val df = spark.read
.json("s3n://BUCKET-NAME/FOLDER/FILE.json")
.select($"uid").show(5)
我给了AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY作为环境变量.尝试从S3读取内容时,我遇到以下错误.
I have given the AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY as environment variables. I face below error while trying to read from S3.
Exception in thread "main" org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: S3 HEAD request failed for '/FOLDER%2FFILE.json' - ResponseCode=400, ResponseMessage=Bad Request
我怀疑错误是由于某些内部函数将"/"转换为%2F"引起的,因为错误显示为"/FOLDER%2FFILE.json"而不是"/FOLDER/FILE.json"
I suspect the error is caused due to "/" being converted to "%2F" by some internal function as the error shows '/FOLDER%2FFILE.json' instead of '/FOLDER/FILE.json'
推荐答案
如果您不告知,您的spark(jvm)应用程序将无法读取环境变量,因此可以快速解决:
Your spark (jvm) application cannot read environment variable if you don't tell it to, so a quick work around :
spark.sparkContext
.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", awsAccessKeyId)
spark.sparkContext
.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", awsSecretAccessKey)
您还需要精确设置s3端点:
You'll also need to precise the s3 endpoint :
spark.sparkContext
.hadoopConfiguration.set("fs.s3a.endpoint", "<<ENDPOINT>>");
要了解有关什么是AWS S3 Endpoint的更多信息,请参阅以下文档:
To know more about what is AWS S3 Endpoint, refer to the following documentation :
这篇关于无法使用Spark从S3存储桶读取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!