Hadoop 2.9.2,Spark 2.4.0访问AWS s3a存储桶 [英] Hadoop 2.9.2, Spark 2.4.0 access AWS s3a bucket
问题描述
已经过了几天,但我无法使用Spark从公共Amazon Bucket下载:(
It's been a couple of days but I could not download from public Amazon Bucket using Spark :(
这是spark-shell
命令:
spark-shell --master yarn
-v
--jars file:/usr/local/hadoop/share/hadoop/tools/lib/hadoop-aws-2.9.2.jar,file:/usr/local/hadoop/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.199.jar
--driver-class-path=/usr/local/hadoop/share/hadoop/tools/lib/hadoop-aws-2.9.2.jar:/usr/local/hadoop/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.199.jar
应用程序启动,shell等待提示:
Application started and shell waiting for prompt:
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.0
/_/
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_191)
Type in expressions to have them evaluated.
Type :help for more information.
scala> val data1 = sc.textFile("s3a://my-bucket-name/README.md")
18/12/25 13:06:40 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 242.1 KB, free 246.7 MB)
18/12/25 13:06:40 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 24.2 KB, free 246.6 MB)
18/12/25 13:06:40 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on hadoop-edge01:3545 (size: 24.2 KB, free: 246.9 MB)
18/12/25 13:06:40 INFO SparkContext: Created broadcast 0 from textFile at <console>:24
data1: org.apache.spark.rdd.RDD[String] = s3a://my-bucket-name/README.md MapPartitionsRDD[1] at textFile at <console>:24
scala> data1.count()
java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StorageStatistics
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:97)
at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:206)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
at org.apache.spark.rdd.RDD.count(RDD.scala:1168)
... 49 elided
Caused by: java.lang.ClassNotFoundException:
org.apache.hadoop.fs.StorageStatistics
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 77 more
scala>
- All AWS keys, secret-keys was set in hadoop/core-site.xml as described here: Hadoop-AWS module: Integration with Amazon Web Services
- The bucket is public - anyone can download (tested with curl -O)
- All .jars as you can see was provided by Hadoop itself from
/usr/local/hadoop/share/hadoop/tools/lib/
folder - There's no additional settings in
spark-defaults.conf
- only what was sent in command line Both jars does not provide this class:
jar tf /usr/local/hadoop/share/hadoop/tools/lib/hadoop-aws-2.9.2.jar | grep org/apache/hadoop/fs/StorageStatistics
(no result)
jar tf /usr/local/hadoop/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.199.jar | grep org/apache/hadoop/fs/StorageStatistics
(no result)
我该怎么办?我忘了加另一个罐子吗? hadoop-aws
和aws-java-sdk-bundle
的确切配置是什么?版本?
What should I do ? Did I forget to add another jar ? What the exact configuration of hadoop-aws
and aws-java-sdk-bundle
? versions ?
推荐答案
嗯……我终于找到了问题.
Mmmm.... I found the problem, finally..
主要问题是我为Hadoop预安装了Spark.它是针对Hadoop 2.7及更高版本的v2.4.0预先构建".正如您在上面看到的我为之奋斗时所说的那样,这有点误导标题.实际上,Spark随附了不同版本的hadoop jars./usr/local/spark/jars/中的清单显示它具有:
The main issue is Spark that I have is pre-installed for Hadoop. It's 'v2.4.0 pre-build for Hadoop 2.7 and later'. This is bit of misleading title as you see my struggles with it above. Actually Spark shipped with different version of hadoop jars. The listing from: /usr/local/spark/jars/ shows that it have:
hadoop-common-2.7.3.jar
hadoop-client-2.7.3.jar
....
hadoop-common-2.7.3.jar
hadoop-client-2.7.3.jar
....
它仅丢失:hadoop-aws和aws-java-sdk.我在Maven存储库中进行了一点挖掘: hadoop-aws-v2.7.3 及其依赖项: aws-java-sdk-v1 .7.4 ,瞧!下载了这些jar并将 them 作为参数发送到Spark.像这样:
it only missing: hadoop-aws and aws-java-sdk. I little bit digging in Maven repository: hadoop-aws-v2.7.3 and it dependency: aws-java-sdk-v1.7.4 and voila ! Downloaded those jar and send them as parameters to Spark. Like this:
火花壳
-母纱
-v
--jars文件:/home/aws-java-sdk-1.7.4.jar,文件:/home/hadoop-aws-2.7.3.jar
--driver-class-path =/home/aws-java-sdk-1.7.4.jar:/home/hadoop-aws-2.7.3.jar
spark-shell
--master yarn
-v
--jars file:/home/aws-java-sdk-1.7.4.jar,file:/home/hadoop-aws-2.7.3.jar
--driver-class-path=/home/aws-java-sdk-1.7.4.jar:/home/hadoop-aws-2.7.3.jar
完成工作了!!!
我只是想知道为什么所有来自Hadoop的jar(我都将它们作为参数发送到--jars和--driver-class-path)没有赶上. Spark会以某种方式自动选择罐子,而不是我发送的罐子
I'm just wondering why all jars from Hadoop (and I send all of them as parameter to --jars and --driver-class-path) didn't catch up. Spark somehow automatically choose it jars and not what I send
这篇关于Hadoop 2.9.2,Spark 2.4.0访问AWS s3a存储桶的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!