读本地文件S3通过火花(或更好:pyspark) [英] Locally reading S3 files through Spark (or better: pyspark)
问题描述
我想读从我的(本地)机的S3文件,通过火花(pyspark,真的)。现在,我不断收到身份验证错误像
I want to read an S3 file from my (local) machine, through Spark (pyspark, really). Now, I keep getting authentication errors like
java.lang.IllegalArgumentException异常:AWS访问密钥ID和密钥
访问键必须被指定为用户名或密码
(分别)一个S3N的URL,或通过设置fs.s3n.awsAccessKeyId
或fs.s3n.awsSecretAccessKey特性(分别)。
java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).
我到处在这里和在网络上,尝试了很多事情,但显然S3已经改变在过去的一年或几个月,所有的方法失败,但之一:
I looked everywhere here and on the web, tried many things, but apparently S3 has been changing over the last year or months, and all methods failed but one:
pyspark.SparkContext().textFile("s3n://user:password@bucket/key")
(注意 S3N
[ S3
没有工作])。现在,我不希望使用的用户名和密码的URL,因为他们可以在日志中出现,而我也不清楚如何从让他们〜/ .aws /凭证
文件反正。
(note the s3n
[s3
did not work]). Now, I don't want to use a URL with the user and password because they can appear in logs, and I am also not sure how to get them from the ~/.aws/credentials
file anyway.
那么,怎样才能在当地我从S3通过使用从现在的standard 〜/ .aws /凭证
文件(理想情况下,不复制凭据那里另一个配置文件)?
So, how can I read locally from S3 through Spark (or, better, pyspark) using the AWS credentials from the now standard ~/.aws/credentials
file (ideally, without copying the credentials there to yet another configuration file)?
PS:我试过 os.environ [AWS_ACCESS_KEY_ID] = ...
和 os.environ [AWS_SECRET_ACCESS_KEY] = ...
,也没有工作。
PS: I tried os.environ["AWS_ACCESS_KEY_ID"] = …
and os.environ["AWS_SECRET_ACCESS_KEY"] = …
, it did not work.
PPS:我不知道在哪里设置fs.s3n.awsAccessKeyId或fs.s3n.awsSecretAccessKey属性(谷歌并没有拿出任何东西)。不过,我也尝试设置的许多方面,这些: SparkContext.setSystemProperty()
, sc.setLocalProperty()
,和 CONF = SparkConf(); conf.set(...); conf.set(...); SC = SparkContext(CONF = CONF)
。毫无效果。
PPS: I am not sure where to "set the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties" (Google did not come up with anything). However, I did try many ways of setting these: SparkContext.setSystemProperty()
, sc.setLocalProperty()
, and conf = SparkConf(); conf.set(…); conf.set(…); sc = SparkContext(conf=conf)
. Nothing worked.
推荐答案
这个问题实际上是在亚马逊的博托
Python模块中的错误。该问题涉及的事实,MacPort的版本实际上是老:安装博托
通过PIP解决了这个问题:〜/ .aws /凭证
被正确读取。
The problem was actually a bug in the Amazon's boto
Python module. The problem was related to the fact that MacPort's version is actually old: installing boto
through pip solved the problem: ~/.aws/credentials
was correctly read.
现在,我有更多的经验,我会说是一般(截至2015年底)的亚马逊网络服务工具和星火/ PySpark有斑片状的文档,可以有一些严重的缺陷非常容易碰到。对于第一个问题,我建议先更新AWS命令行界面,博托
和Spark每一次奇怪的事情发生了:这神奇解决的几个问题已为我的。
Now that I have more experience, I would say that in general (as of the end of 2015) Amazon Web Services tools and Spark/PySpark have a patchy documentation and can have some serious bugs that are very easy to run into. For the first problem, I would recommend to first update the aws command line interface, boto
and Spark every time something strange happens: this has "magically" solved a few issues already for me.
这篇关于读本地文件S3通过火花(或更好:pyspark)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!