无法使用Spark 2.2访问S3数据 [英] Unable to access S3 data using Spark 2.2

查看:139
本文介绍了无法使用Spark 2.2访问S3数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我将很多数据上传到了我想要的S3存储桶中,因此可以使用Spark和Zeppelin进行分析/可视化.但是,我仍然停留在从S3加载数据的过程中.

I get a lot of data uploaded to an S3 bucket that I want so analyze/visualize using Spark and Zeppelin. Yet, I am still stuck at loading data from S3.

我进行了一些阅读,以使这些内容结合在一起,并为我省去一些细节.我正在将Docker容器 p7hb/docker-spark 用作Spark安装和我从S3读取数据的基本测试来自此处 :

I did some reading in order to get this together and spare me gory details. I am using the docker container p7hb/docker-spark as Spark installation and my basic test for reading data from S3 is derived from here:

  1. 我启动了容器,并在其中启动了主进程和从进程.我可以通过查看在端口8080上显示的Spark Master WebUI来验证此工作.此页面确实列出了工作程序,并在已完成的应用程序"标题下保留了我所有失败尝试的日志.所有这些都处于状态FINISHED.

我在该容器中打开一个bash并执行以下操作:

I open a bash inside that container and do the following:

a)导出环境变量AWS_ACCESS_KEY_IDAWS_SECRET_ACCESS_KEY,如建议的

a) export the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, as suggested here.

b)启动spark-shell.为了访问S3,似乎需要加载一些额外的程序包.浏览SE时,我特别发现告诉了我,我可以使用--packages参数加载所述软件包.本质上,我运行spark-shell --packages com.amazonaws:aws-java-sdk:1.7.15,org.apache.hadoop:hadoop-aws:2.7.5(,用于版本的任意组合).

b) start spark-shell. In order to access S3 one seems to need to load some extra packages. Browsing through SE I found especially this, which teaches me, that I can use the --packages parameter to load said packages. Essentially I run spark-shell --packages com.amazonaws:aws-java-sdk:1.7.15,org.apache.hadoop:hadoop-aws:2.7.5(, for arbitrary combinations of versions).

c)我运行以下代码

sc.hadoopConfiguration.set("fs.s3a.endpoint", "s3-eu-central-1.amazonaws.com") sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") sc.hadoopConfiguration.set("com.amazonaws.services.s3.enableV4", "true")

val sonnets=sc.textFile("s3a://my-bucket/my.file")

val counts = sonnets.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)

然后我会得到各种不同的错误消息,具体取决于我在2b)中选择的版本.

And then I get all kinds of different Error messages, depending on the versions I choose in 2b).

我想2a),b/c没有错,如果我不提供它们,则会收到错误消息Unable to load AWS credentials from any provider in the chain. 这是新用户似乎犯的一个已知错误.

I suppose there is nothing wrong with 2a), b/c I get the error message Unable to load AWS credentials from any provider in the chain if I don't supply those. This is a known error new users seem to make.

在尝试解决此问题时,我从 这两个额外的软件包.在SE的某个地方,我读到hadoop-aws:2.7应该是正确的选择,因为Spark 2.2基于Hadoop 2.7.可能需要在该版本的hadoop-aws中使用aws-java-sdk:1.7.

While trying to solve the issue, I pick more or less random versions from here and there for the two extra packages. Somewhere on SE I read that hadoop-aws:2.7 is supposed to be the right choice, because Spark 2.2 is based on Hadoop 2.7. Supposedly one needs to use aws-java-sdk:1.7 with that version of hadoop-aws.

随便啦!我尝试了以下组合

Whatever! I tried thefollowing combinations

  • --packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.1,这会产生常见的Bad Request 400错误. 许多问题都可能导致该错误,如上所述,我的尝试包含了我在此页面上可以找到的所有内容.上面的描述包含s3-eu-central-1.amazonaws.com作为端点,而其他地方使用s3.eu-central-1.amazonaws.com.根据在此处输入链接描述,两个端点名称应该工作.我都尝试过.

  • --packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.1, which yields the common Bad Request 400 error. Many problems can lead to that error, my attempt as described above containseverything I was able to find on this page. The description above contains s3-eu-central-1.amazonaws.com as endpoint, while other places use s3.eu-central-1.amazonaws.com. According to enter link description here, both endpoint names are supposed to work. I did try both.

--packages com.amazonaws:aws-java-sdk:1.7.15,org.apache.hadoop:hadoop-aws:2.7.5,这两种情况都是最新的微型版本,我收到错误消息java.lang.NoSuchMethodError: com.amazonaws.services.s3.transfer.TransferManager.<init>(Lcom/amazonaws/services/s3/AmazonS3;Ljava/util/concurrent/ThreadPoolExecuto r;)V

--packages com.amazonaws:aws-java-sdk:1.7.15,org.apache.hadoop:hadoop-aws:2.7.5, which are the most recent micro versions in either case, I get the error message java.lang.NoSuchMethodError: com.amazonaws.services.s3.transfer.TransferManager.<init>(Lcom/amazonaws/services/s3/AmazonS3;Ljava/util/concurrent/ThreadPoolExecuto r;)V

--packages com.amazonaws:aws-java-sdk:1.11.275,org.apache.hadoop:hadoop-aws:2.7.5,我也得到了java.lang.NoSuchMethodError: com.amazonaws.services.s3.transfer.TransferManager.<init>(Lcom/amazonaws/services/s3/AmazonS3;Ljava/util/concurrent/ThreadPoolExecutor;)V

--packages com.amazonaws:aws-java-sdk:1.11.275,org.apache.hadoop:hadoop-aws:2.8.1,我得到java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation

--packages com.amazonaws:aws-java-sdk:1.11.275,org.apache.hadoop:hadoop-aws:2.8.3,我也得到了java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation

--packages com.amazonaws:aws-java-sdk:1.8.12,org.apache.hadoop:hadoop-aws:2.8.3,我也得到了java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation

--packages com.amazonaws:aws-java-sdk:1.11.275,org.apache.hadoop:hadoop-aws:2.9.0,我也得到了java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StorageStatistics

而且,出于完整性考虑,当我不提供--packages参数时,我会得到java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found.

And, for completeness sake, when I don't provide the --packages parameter, I get java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found.

当前似乎没有任何效果.但是,关于此主题的问题解答如此之多,谁知道这样做的方式.全部都在本地模式下,因此几乎没有其他错误来源.我访问S3的方法一定是错误的.如何正确完成?

Currently nothing seems to work. Yet, there are so many Q/As on this topic, who knows what's the way du jour of doing this. This is all in local mode, so there is virtually no other source of error. My method of accessing S3 must be wrong. How is it done correctly?

所以我又花了一天时间,没有任何实际进展.据我所知,从Hadoop 2.6开始,Hadoop不再内置对S3的支持,但是它将通过其他库加载,这些库不是Hadoop的一部分,而是完全由Hadoop自己管理.除了所有混乱之外,我最终想要的库似乎是hadoop-aws.它有一个网页此处并带有所谓的权威信息:

So I put another day into this, without any actual progress. As far as I can tell, starting from Hadoop 2.6, Hadoop doesn't have built in support for S3 anymore, but it as to be loaded through additional libraries, which are not part of Hadoop and entirely managed by itself. Besides all the clutter, the library I ultimately want seems to be hadoop-aws. It has a webpage here andit carries what I would call authoritative information:

hadoop-common和hadoop-aws的版本必须相同.

The versions of hadoop-common and hadoop-aws must be identical.

关于此信息的重要一点是,hadoop-common实际上确实与Hadoop安装一起提供.每个Hadoop安装都有一个对应的jar文件,因此这是一个坚实的起点.我的容器中有一个文件/usr/hadoop-2.7.3/share/hadoop/common/hadoop-common-2.7.3.jar,因此可以公平地假设2.7.3是我需要的hadoop-aws版本.

The important thing about this information is, that hadoop-common actually does ship with a Hadoop installation. Every Hadoop installation has a corresponding jar file, so this is a solid starting point. My containers have a file /usr/hadoop-2.7.3/share/hadoop/common/hadoop-common-2.7.3.jar so it is fair to assume 2.7.3 is the version I need for hadoop-aws.

在那之后它变得晦暗. Hadoop版本2.7.x在内部具有某些功能,因此它们与aws-java-sdk的最新版本不兼容,该版本是hadoop-aws所需的库.互联网上充斥着有关使用1.7.4版本的建议,例如在此处 ,但其他评论建议将1.7.14版本用于2.7.x.

After that it gets murky. Hadoop versions 2.7.x have something going on internally, so that they are not compatible with more recent versions of aws-java-sdk, which is a library required by hadoop-aws. The Internet is full of advice to use version 1.7.4, for example here, but other comments suggest to using version 1.7.14 for 2.7.x.

所以我又用hadoop-aws:2.7.3aws-java-sdk:1.7.x进行了一次运行,x的范围是4到14.无结果,我总是遇到错误400,错误请求.

So I did another run using hadoop-aws:2.7.3 and aws-java-sdk:1.7.x, with x ranging from 4 to 14. No results whatsover, I always end up with error 400, Bad Request.

我的Hadoop安装随附joda-time 2.9.4.我阅读了使用Hadoop 2.8解决的问题.我想我将继续使用最新版本构建自己的Docker容器.

My Hadoop installation ships joda-time 2.9.4. I read the problem was resolved with Hadoop 2.8. I suppose I will just go ahead and build my own docker containers with more recent versions.

已移至Hadoop 2.8.3.现在就可以使用.事实证明,您甚至都不需要弄混JAR. Hadoop附带了用于访问AWS S3的有效JAR.它们隐藏在${HADOOP_HOME}/share/hadoop/tools/lib中,默认情况下不添加到类路径中.我只需将JARS加载到该目录中,按照上面所述执行我的代码,现在它可以工作了.

Moved to Hadoop 2.8.3. It just works now. Turns out you don't even have to mess around with JARs at all. Hadoop ships with what are supposed to be working JARs for accessing AWS S3. They are hidden in ${HADOOP_HOME}/share/hadoop/tools/lib and not added to the classpath by default. I simply load the JARS in that directory, execute my code as stated above and now it works.

推荐答案

将AWS开发工具包JAR与其他任何东西混合并匹配是徒劳的.您需要使用Hadoop构建的AWS JAR版本,以及构建Hadoop的Jackson AWS版本.哦,不要尝试混合任何一种(不同的Amazon- * JAR,不同的hadoop- * JAR,不同的jackson- * JAR);他们都进入了锁同步状态.

Mixing and matching AWS SDK JARs with anything else is an exercise in futility, as you've discovered. You need the version of the AWS JARs Hadoop was built with, and the version of Jackson AWS was built with. Oh, and don't try mixing any of (different amazon-* JARs, different hadoop-* JARs, different jackson-* JARs); they all go in lock-sync.

对于Spark 2.2.0和Hadoop 2.7,请使用AWS 1.7.4工件,并确保(如果您使用的是Java 8)Joda时间> 2.8.0,例如2.9.4.这可能会导致400个错误的身份验证问题".

For Spark 2.2.0 and Hadoop 2.7, use AWS 1.7.4 artifacts, and make sure that if you are on Java 8, that Joda time is > 2.8.0, such as 2.9.4. That can lead to 400 "bad auth problems".

否则,请尝试 查看全文

登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆