读本地文件S3通过火花(或更好:pyspark) [英] Locally reading S3 files through Spark (or better: pyspark)

查看:397
本文介绍了读本地文件S3通过火花(或更好:pyspark)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想读从我的(本地)机的S3文件,通过火花(pyspark,真的)。现在,我不断收到身份验证错误像

I want to read an S3 file from my (local) machine, through Spark (pyspark, really). Now, I keep getting authentication errors like

java.lang.IllegalArgumentException异常:AWS访问密钥ID和密钥
  访问键必须被指定为用户名或密码
  (分别)一个S3N的URL,或通过设置fs.s3n.awsAccessKeyId
  或fs.s3n.awsSecretAccessKey特性(分别)。

java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).

我到处在这里和在网络上,尝试了很多事情,但显然S3已经改变在过去的一年或几个月,所有的方法失败,但之一:

I looked everywhere here and on the web, tried many things, but apparently S3 has been changing over the last year or months, and all methods failed but one:

pyspark.SparkContext().textFile("s3n://user:password@bucket/key")

(注意 S3N [ S3 没有工作])。现在,我不希望使用的用户名和密码的URL,因为他们可以在日志中出现,而我也不清楚如何从让他们〜/ .aws /凭证文件反正。

(note the s3n [s3 did not work]). Now, I don't want to use a URL with the user and password because they can appear in logs, and I am also not sure how to get them from the ~/.aws/credentials file anyway.

那么,怎样才能在当地我从S3通过使用从现在的standard 〜/ .aws /凭证文件(理想情况下,不复制凭据那里另一个配置文件)?

So, how can I read locally from S3 through Spark (or, better, pyspark) using the AWS credentials from the now standard ~/.aws/credentials file (ideally, without copying the credentials there to yet another configuration file)?

PS:我试过 os.environ [AWS_ACCESS_KEY_ID] = ... os.environ [AWS_SECRET_ACCESS_KEY] = ... ,也没有工作。

PS: I tried os.environ["AWS_ACCESS_KEY_ID"] = … and os.environ["AWS_SECRET_ACCESS_KEY"] = …, it did not work.

PPS:我不知道在哪里设置fs.s3n.awsAccessKeyId或fs.s3n.awsSecretAccessKey属性(谷歌并没有拿出任何东西)。不过,我也尝试设置的许多方面,这些: SparkContext.setSystemProperty() sc.setLocalProperty(),和 CONF = SparkConf(); conf.set(...); conf.set(...); SC = SparkContext(CONF = CONF)。毫无效果。

PPS: I am not sure where to "set the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties" (Google did not come up with anything). However, I did try many ways of setting these: SparkContext.setSystemProperty(), sc.setLocalProperty(), and conf = SparkConf(); conf.set(…); conf.set(…); sc = SparkContext(conf=conf). Nothing worked.

推荐答案

这个问题实际上是在亚马逊的博托 Python模块中的错误。该问题涉及的事实,MacPort的版本实际上是老:安装博托通过PIP解决了这个问题:〜/ .aws /凭证被正确读取。

The problem was actually a bug in the Amazon's boto Python module. The problem was related to the fact that MacPort's version is actually old: installing boto through pip solved the problem: ~/.aws/credentials was correctly read.

现在,我有更多的经验,我会说是一般(截至2015年底)的亚马逊网络服务工具和星火/ PySpark有斑片状的文档,可以有一些严重的缺陷非常容易碰到。对于第一个问题,我建议先更新AWS命令行界面,博托和Spark每一次奇怪的事情发生了:这神奇解决的几个问题已为我的。

Now that I have more experience, I would say that in general (as of the end of 2015) Amazon Web Services tools and Spark/PySpark have a patchy documentation and can have some serious bugs that are very easy to run into. For the first problem, I would recommend to first update the aws command line interface, boto and Spark every time something strange happens: this has "magically" solved a few issues already for me.

这篇关于读本地文件S3通过火花(或更好:pyspark)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆