通过 Spark 本地读取 S3 文件(或更好:pyspark) [英] Locally reading S3 files through Spark (or better: pyspark)

查看：110 发布时间：2021/12/15 20:18:43 authentication amazon-s3 apache-spark credentials pyspark

本文介绍了通过 Spark 本地读取 S3 文件(或更好:pyspark)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想通过 Spark(实际上是 pyspark)从我的(本地)机器上读取一个 S3 文件.现在，我不断收到身份验证错误，例如

<块引用>

java.lang.IllegalArgumentException:AWS 访问密钥 ID 和密钥必须将访问密钥指定为用户名或密码(分别)一个 s3n URL，或通过设置 fs.s3n.awsAccessKeyId或 fs.s3n.awsSecretAccessKey 属性(分别).

我在这里和网上到处找，尝试了很多东西，但显然 S3 在过去的一年或几个月里一直在变化，所有方法都失败了:

pyspark.SparkContext().textFile("s3n://user:password@bucket/key")

(注意 s3n [s3 不起作用]).现在，我不想使用带有用户名和密码的 URL，因为它们会出现在日志中，而且我也不知道如何从 ~/.aws/credentials 文件中获取它们.

那么，我如何使用现在标准 ~/.aws/credentials 文件(理想情况下，不复制另一个配置文件的凭据)?

PS:我试过 os.environ["AWS_ACCESS_KEY_ID"] = ... 和 os.environ["AWS_SECRET_ACCESS_KEY"] = ...，它没有用.>

PPS:我不确定在哪里设置 fs.s3n.awsAccessKeyId 或 fs.s3n.awsSecretAccessKey 属性"(谷歌没有想出任何东西).但是，我确实尝试了很多设置这些的方法:SparkContext.setSystemProperty()、sc.setLocalProperty() 和 conf = SparkConf();conf.set(…);conf.set(…);sc = SparkContext(conf=conf).没有任何效果.

解决方案

问题实际上是亚马逊的 boto Python 模块中的一个错误.问题与 MacPort 的版本实际上很旧有关:通过 pip 安装 boto 解决了问题:~/.aws/credentials 被正确读取.

现在我有了更多的经验，我想说总的来说(截至 2015 年底)Amazon Web Services 工具和 Spark/PySpark 的文档不完整，并且可能存在一些很容易遇到的严重错误.对于第一个问题，我建议每次发生奇怪的事情时，首先更新 aws 命令行界面、boto 和 Spark:这已经神奇地"解决了一些问题对我来说.

I want to read an S3 file from my (local) machine, through Spark (pyspark, really). Now, I keep getting authentication errors like

java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).

I looked everywhere here and on the web, tried many things, but apparently S3 has been changing over the last year or months, and all methods failed but one:

pyspark.SparkContext().textFile("s3n://user:password@bucket/key")

(note the s3n [s3 did not work]). Now, I don't want to use a URL with the user and password because they can appear in logs, and I am also not sure how to get them from the ~/.aws/credentials file anyway.

So, how can I read locally from S3 through Spark (or, better, pyspark) using the AWS credentials from the now standard ~/.aws/credentials file (ideally, without copying the credentials there to yet another configuration file)?

PS: I tried os.environ["AWS_ACCESS_KEY_ID"] = … and os.environ["AWS_SECRET_ACCESS_KEY"] = …, it did not work.

PPS: I am not sure where to "set the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties" (Google did not come up with anything). However, I did try many ways of setting these: SparkContext.setSystemProperty(), sc.setLocalProperty(), and conf = SparkConf(); conf.set(…); conf.set(…); sc = SparkContext(conf=conf). Nothing worked.

解决方案

The problem was actually a bug in the Amazon's boto Python module. The problem was related to the fact that MacPort's version is actually old: installing boto through pip solved the problem: ~/.aws/credentials was correctly read.

Now that I have more experience, I would say that in general (as of the end of 2015) Amazon Web Services tools and Spark/PySpark have a patchy documentation and can have some serious bugs that are very easy to run into. For the first problem, I would recommend to first update the aws command line interface, boto and Spark every time something strange happens: this has "magically" solved a few issues already for me.

这篇关于通过 Spark 本地读取 S3 文件(或更好:pyspark)的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

通过 Spark 本地读取 S3 文件(或更好:pyspark) [英] Locally reading S3 files through Spark (or better: pyspark)

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

通过 Spark 本地读取 S3 文件(或更好:pyspark) [英] Locally reading S3 files through Spark (or better: pyspark)

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭