通过 Spark 本地读取 S3 文件(或更好:pyspark) [英] Locally reading S3 files through Spark (or better: pyspark)

查看:110
本文介绍了通过 Spark 本地读取 S3 文件(或更好:pyspark)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想通过 Spark(实际上是 pyspark)从我的(本地)机器上读取一个 S3 文件.现在,我不断收到身份验证错误,例如

<块引用>

java.lang.IllegalArgumentException:AWS 访问密钥 ID 和密钥必须将访问密钥指定为用户名或密码(分别)一个 s3n URL,或通过设置 fs.s3n.awsAccessKeyId或 fs.s3n.awsSecretAccessKey 属性(分别).

我在这里和网上到处找,尝试了很多东西,但显然 S3 在过去的一年或几个月里一直在变化,所有方法都失败了:

pyspark.SparkContext().textFile("s3n://user:password@bucket/key")

(注意 s3n [s3 不起作用]).现在,我不想使用带有用户名和密码的 URL,因为它们会出现在日志中,而且我也不知道如何从 ~/.aws/credentials 文件中获取它们.

那么,我如何使用现在 标准 ~/.aws/credentials 文件(理想情况下,不复制另一个配置文件的凭据)?

PS:我试过 os.environ["AWS_ACCESS_KEY_ID"] = ...os.environ["AWS_SECRET_ACCESS_KEY"] = ...,它没有用.>

PPS:我不确定在哪里设置 fs.s3n.awsAccessKeyId 或 fs.s3n.awsSecretAccessKey 属性"(谷歌没有想出任何东西).但是,我确实尝试了很多设置这些的方法:SparkContext.setSystemProperty()sc.setLocalProperty()conf = SparkConf();conf.set(…);conf.set(…);sc = SparkContext(conf=conf).没有任何效果.

解决方案

问题实际上是亚马逊的 boto Python 模块中的一个错误.问题与 MacPort 的版本实际上很旧有关:通过 pip 安装 boto 解决了问题:~/.aws/credentials 被正确读取.

现在我有了更多的经验,我想说总的来说(截至 2015 年底)Amazon Web Services 工具和 Spark/PySpark 的文档不完整,并且可能存在一些很容易遇到的严重错误.对于第一个问题,我建议每次发生奇怪的事情时,首先更新 aws 命令行界面、boto 和 Spark:这已经神奇地"解决了一些问题对我来说.

I want to read an S3 file from my (local) machine, through Spark (pyspark, really). Now, I keep getting authentication errors like

java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).

I looked everywhere here and on the web, tried many things, but apparently S3 has been changing over the last year or months, and all methods failed but one:

pyspark.SparkContext().textFile("s3n://user:password@bucket/key")

(note the s3n [s3 did not work]). Now, I don't want to use a URL with the user and password because they can appear in logs, and I am also not sure how to get them from the ~/.aws/credentials file anyway.

So, how can I read locally from S3 through Spark (or, better, pyspark) using the AWS credentials from the now standard ~/.aws/credentials file (ideally, without copying the credentials there to yet another configuration file)?

PS: I tried os.environ["AWS_ACCESS_KEY_ID"] = … and os.environ["AWS_SECRET_ACCESS_KEY"] = …, it did not work.

PPS: I am not sure where to "set the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties" (Google did not come up with anything). However, I did try many ways of setting these: SparkContext.setSystemProperty(), sc.setLocalProperty(), and conf = SparkConf(); conf.set(…); conf.set(…); sc = SparkContext(conf=conf). Nothing worked.

解决方案

The problem was actually a bug in the Amazon's boto Python module. The problem was related to the fact that MacPort's version is actually old: installing boto through pip solved the problem: ~/.aws/credentials was correctly read.

Now that I have more experience, I would say that in general (as of the end of 2015) Amazon Web Services tools and Spark/PySpark have a patchy documentation and can have some serious bugs that are very easy to run into. For the first problem, I would recommend to first update the aws command line interface, boto and Spark every time something strange happens: this has "magically" solved a few issues already for me.

这篇关于通过 Spark 本地读取 S3 文件(或更好:pyspark)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆