通过 Spark 本地读取 S3 文件(或更好:pyspark) [英] Locally reading S3 files through Spark (or better: pyspark)
问题描述
我想通过 Spark(实际上是 pyspark)从我的(本地)机器上读取一个 S3 文件.现在,我不断收到身份验证错误,例如
<块引用>java.lang.IllegalArgumentException:AWS 访问密钥 ID 和密钥必须将访问密钥指定为用户名或密码(分别)一个 s3n URL,或通过设置 fs.s3n.awsAccessKeyId或 fs.s3n.awsSecretAccessKey 属性(分别).
我在这里和网上到处找,尝试了很多东西,但显然 S3 在过去的一年或几个月里一直在变化,所有方法都失败了:
pyspark.SparkContext().textFile("s3n://user:password@bucket/key")
(注意 s3n
[s3
不起作用]).现在,我不想使用带有用户名和密码的 URL,因为它们会出现在日志中,而且我也不知道如何从 ~/.aws/credentials
文件中获取它们.
那么,我如何使用现在 标准 ~/.aws/credentials
文件(理想情况下,不复制另一个配置文件的凭据)?
PS:我试过 os.environ["AWS_ACCESS_KEY_ID"] = ...
和 os.environ["AWS_SECRET_ACCESS_KEY"] = ...
,它没有用.>
PPS:我不确定在哪里设置 fs.s3n.awsAccessKeyId 或 fs.s3n.awsSecretAccessKey 属性"(谷歌没有想出任何东西).但是,我确实尝试了很多设置这些的方法:SparkContext.setSystemProperty()
、sc.setLocalProperty()
和 conf = SparkConf();conf.set(…);conf.set(…);sc = SparkContext(conf=conf)
.没有任何效果.
问题实际上是亚马逊的 boto
Python 模块中的一个错误.问题与 MacPort 的版本实际上很旧有关:通过 pip 安装 boto
解决了问题:~/.aws/credentials
被正确读取.
现在我有了更多的经验,我想说总的来说(截至 2015 年底)Amazon Web Services 工具和 Spark/PySpark 的文档不完整,并且可能存在一些很容易遇到的严重错误.对于第一个问题,我建议每次发生奇怪的事情时,首先更新 aws 命令行界面、boto
和 Spark:这已经神奇地"解决了一些问题对我来说.
I want to read an S3 file from my (local) machine, through Spark (pyspark, really). Now, I keep getting authentication errors like
java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).
I looked everywhere here and on the web, tried many things, but apparently S3 has been changing over the last year or months, and all methods failed but one:
pyspark.SparkContext().textFile("s3n://user:password@bucket/key")
(note the s3n
[s3
did not work]). Now, I don't want to use a URL with the user and password because they can appear in logs, and I am also not sure how to get them from the ~/.aws/credentials
file anyway.
So, how can I read locally from S3 through Spark (or, better, pyspark) using the AWS credentials from the now standard ~/.aws/credentials
file (ideally, without copying the credentials there to yet another configuration file)?
PS: I tried os.environ["AWS_ACCESS_KEY_ID"] = …
and os.environ["AWS_SECRET_ACCESS_KEY"] = …
, it did not work.
PPS: I am not sure where to "set the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties" (Google did not come up with anything). However, I did try many ways of setting these: SparkContext.setSystemProperty()
, sc.setLocalProperty()
, and conf = SparkConf(); conf.set(…); conf.set(…); sc = SparkContext(conf=conf)
. Nothing worked.
The problem was actually a bug in the Amazon's boto
Python module. The problem was related to the fact that MacPort's version is actually old: installing boto
through pip solved the problem: ~/.aws/credentials
was correctly read.
Now that I have more experience, I would say that in general (as of the end of 2015) Amazon Web Services tools and Spark/PySpark have a patchy documentation and can have some serious bugs that are very easy to run into. For the first problem, I would recommend to first update the aws command line interface, boto
and Spark every time something strange happens: this has "magically" solved a few issues already for me.
这篇关于通过 Spark 本地读取 S3 文件(或更好:pyspark)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!