如何为放置在pyspark中的s3指定服务器端加密? [英] How to specify server side encryption for s3 put in pyspark?
问题描述
由于stackoverflow,我设法将hadoop-aws-2.7.3.jar和aws-java-sdk-1.7.4.jar从Maven存储库复制到$ SPARK_HOME/jars/中,以获取s3a://进行阅读在我的ec2 linux实例上使用pyspark(spark 2.2.0)从S3存储桶中提取数据.
Thanks to stackoverflow, i managed to copy hadoop-aws-2.7.3.jar and aws-java-sdk-1.7.4.jar from maven repo into $SPARK_HOME/jars/ to get s3a:// going for reading from S3 buckets using pyspark (spark 2.2.0) on my ec2 linux instance.
df = spark.read.option("header","true").csv("s3a://bucket/csv_file")
df=spark.read.option("header","true").csv("s3a://bucket/csv_file")
但是我仍然坚持在启用服务器端加密的情况下将转换后的数据写回到s3存储桶中.如预期的那样,由于我尚未指定在pyspark执行环境中启用服务器端加密的标志,因此该操作将引发访问被拒绝"
But I'm stuck at writing the transformed data back into s3 bucket with server side encryption enabled. As expected below action throws "Access Denied" as I haven't specified flag to enable server side encryption within pyspark execution env
df.write.parquet("s3a://s3_bucket/output.parquet")
df.write.parquet("s3a://s3_bucket/output.parquet")
为了验证,我写了一个本地文件,然后使用-sse将其上传到s3存储桶中,效果很好
To verify, I wrote to a local file and uploaded to s3 bucket using -sse and this works fine
aws s3 cp local_path s3://s3_bucket/--sse
aws s3 cp local_path s3://s3_bucket/ --sse
如何在pyspark中启用与上述类似的服务器端加密?
How do I enable server side encryption in pyspark similar to above?
注意:我确实尝试在spark-default.conf中添加"fs.s3a.enableServerSideEncryption true",并在开始时通过pyspark的--conf参数传递它,但没有任何乐趣.
Note: I did try adding "fs.s3a.enableServerSideEncryption true" to spark-default.conf and passing the same via --conf parameter of pyspark at start but no joy.
谢谢
推荐答案
在经历了Hadoop JIRA之后,我的理解方式-HADOOP-10675,HADOOP-10400,HADOOP-10568
The way I understood after going through following Hadoop JIRAs - HADOOP-10675, HADOOP-10400, HADOOP-10568
由于fs/s3是Hadoop的一部分,因此如果您的资产中所有S3存储桶都受SSE保护,则需要将以下内容添加到spark-default.conf中
Since fs/s3 is part of Hadoop following needs to be added into spark-default.conf if all S3 bucket puts in your estate is protected by SSE
spark.hadoop.fs.s3a.server-side-encryption-algorithm AES256
添加此代码后,我能够成功写入受SSE(服务器端加密)保护的S3存储桶.
And after adding this I was able to write successfully to S3 bucket protected by SSE (Server side encryption).
这篇关于如何为放置在pyspark中的s3指定服务器端加密?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!