如何使用AWS SessionToken从pyspark中的S3读取? [英] How do I use an AWS SessionToken to read from S3 in pyspark?
问题描述
假设我正在这样做:
import os os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages "org.apache.hadoop:hadoop-aws:2.7.3" pyspark-shell' from pyspark import SparkConf from pyspark import SparkContext
from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf() \
.setMaster("local[2]") \
.setAppName("pyspark-unittests") \
.set("spark.sql.parquet.compression.codec", "snappy")
sc = SparkContext(conf=conf)
s3File = sc.textFile("s3a://myrepo/test.csv")
print(s3File.count())
print(s3File.id())
我知道,从理论上讲,我可以在调用"sc.textFile(...)"以设置我的凭据之前完成此操作:
I know that, in theory, I can do this before the 'sc.textFile(...)' call to set my credentials:
sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'awsKey')
sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'awsSecret')
但是;我没有密钥/秘密/密码对,而是密钥/秘密/令牌三元组(它们是临时凭证,可通过AssumeRole定期刷新....有关获取这些凭证的详细信息,请参见此处: https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_request.html )
However; I don't have a key/secret pair, instead, I have a key/secret/token triplet (they are temporary credentials that are refreshed periodically via AssumeRole....see here for details on getting those credentials: https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_request.html)
如何使用三元组对AWS S3进行身份验证,而不仅仅是密钥和机密?
How can I use the triplet to authenticate to AWS S3, rather than just the key and secret?
我的偏好是使用com.amazonaws.auth.profile.ProfileCredentialsProvider
作为凭据提供者(并在〜/.aws/credentials中使用密钥/秘密/令牌).我会选择在命令行上提供它们或进行硬编码.
My preference would be to use com.amazonaws.auth.profile.ProfileCredentialsProvider
as the credentials provider (and have the key/secret/token in ~/.aws/credentials). I would settle for providing them on the command line or hard coded.
如果我尝试此操作(使用〜/.aws/credentials中的凭据):
If I try this (with my credentials in ~/.aws/credentials):
sc._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider", "com.amazonaws.auth.profile.ProfileCredentialsProvider")
我仍然得到这个:
py4j.protocol.Py4JJavaError: An error occurred while calling o37.partitions.
: com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain
如何从〜/.aws/credentials加载凭据,或者如何使用SessionToken?
How can I either load credentials from ~/.aws/credentials or otherwise use a SessionToken?
推荐答案
I don't see com.amazonaws.auth.profile.ProfileCredentialsProvider
in the documentation. There is, however, org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider
which allows you to use the key and secret along with fs.s3a.session.token
which is where the token should go.
该页面上的说明说:
要通过以下方式进行身份验证:
To authenticate with these:
- 声明
org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider
为提供者. - 在属性
fs.s3a.session.token
,以及该临时会话的访问和秘密密钥属性中设置会话密钥.
- Declare
org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider
as the provider. - Set the session key in the property
fs.s3a.session.token
, and the access and secret key properties to those of this temporary session.
示例:
<property>
<name>fs.s3a.aws.credentials.provider</name>
<value>org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider</value>
</property>
<property>
<name>fs.s3a.access.key</name>
<value>SESSION-ACCESS-KEY</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>SESSION-SECRET-KEY</value>
</property>
<property>
<name>fs.s3a.session.token</name>
<value>SECRET-SESSION-TOKEN</value>
</property>
这篇关于如何使用AWS SessionToken从pyspark中的S3读取?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!