如何使用AWS SessionToken从pyspark中的S3读取? [英] How do I use an AWS SessionToken to read from S3 in pyspark?

查看:295
本文介绍了如何使用AWS SessionToken从pyspark中的S3读取?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我正在这样做:

import os os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages "org.apache.hadoop:hadoop-aws:2.7.3" pyspark-shell' from pyspark import SparkConf from pyspark import SparkContext
from pyspark import SparkConf
from pyspark import SparkContext

conf = SparkConf() \
        .setMaster("local[2]") \
        .setAppName("pyspark-unittests") \
        .set("spark.sql.parquet.compression.codec", "snappy")
sc = SparkContext(conf=conf)
s3File = sc.textFile("s3a://myrepo/test.csv")
print(s3File.count())
print(s3File.id())

我知道,从理论上讲,我可以在调用"sc.textFile(...)"以设置我的凭据之前完成此操作:

I know that, in theory, I can do this before the 'sc.textFile(...)' call to set my credentials:

sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'awsKey')
sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'awsSecret')

但是;我没有密钥/秘密/密码对,而是密钥/秘密/令牌三元组(它们是临时凭证,可通过AssumeRole定期刷新....有关获取这些凭证的详细信息,请参见此处: https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_request.html )

However; I don't have a key/secret pair, instead, I have a key/secret/token triplet (they are temporary credentials that are refreshed periodically via AssumeRole....see here for details on getting those credentials: https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_request.html)

如何使用三元组对AWS S3进行身份验证,而不仅仅是密钥和机密?

How can I use the triplet to authenticate to AWS S3, rather than just the key and secret?

我的偏好是使用com.amazonaws.auth.profile.ProfileCredentialsProvider作为凭据提供者(并在〜/.aws/credentials中使用密钥/秘密/令牌).我会选择在命令行上提供它们或进行硬编码.

My preference would be to use com.amazonaws.auth.profile.ProfileCredentialsProvider as the credentials provider (and have the key/secret/token in ~/.aws/credentials). I would settle for providing them on the command line or hard coded.

如果我尝试此操作(使用〜/.aws/credentials中的凭据):

If I try this (with my credentials in ~/.aws/credentials):

sc._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider", "com.amazonaws.auth.profile.ProfileCredentialsProvider")

我仍然得到这个:

py4j.protocol.Py4JJavaError: An error occurred while calling o37.partitions.
: com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain

如何从〜/.aws/credentials加载凭据,或者如何使用SessionToken?

How can I either load credentials from ~/.aws/credentials or otherwise use a SessionToken?

推荐答案

我在

I don't see com.amazonaws.auth.profile.ProfileCredentialsProvider in the documentation. There is, however, org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider which allows you to use the key and secret along with fs.s3a.session.token which is where the token should go.

该页面上的说明说:

要通过以下方式进行身份验证:

To authenticate with these:

  1. 声明org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider为提供者.
  2. 在属性fs.s3a.session.token以及该临时会话的访问和秘密密钥属性中设置会话密钥.
  1. Declare org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider as the provider.
  2. Set the session key in the property fs.s3a.session.token, and the access and secret key properties to those of this temporary session.

示例:

<property>
  <name>fs.s3a.aws.credentials.provider</name>
  <value>org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider</value>
</property>

<property>
  <name>fs.s3a.access.key</name>
  <value>SESSION-ACCESS-KEY</value>
</property>

<property>
  <name>fs.s3a.secret.key</name>
  <value>SESSION-SECRET-KEY</value>
</property>

<property>
  <name>fs.s3a.session.token</name>
  <value>SECRET-SESSION-TOKEN</value>
</property>

这篇关于如何使用AWS SessionToken从pyspark中的S3读取?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆