Hadoop 2.9.2,Spark 2.4.0访问AWS s3a存储桶 [英] Hadoop 2.9.2, Spark 2.4.0 access AWS s3a bucket

查看:91
本文介绍了Hadoop 2.9.2,Spark 2.4.0访问AWS s3a存储桶的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

已经过了几天,但我无法使用Spark从公共Amazon Bucket下载:(

It's been a couple of days but I could not download from public Amazon Bucket using Spark :(

这是spark-shell命令:

spark-shell  --master yarn
              -v
              --jars file:/usr/local/hadoop/share/hadoop/tools/lib/hadoop-aws-2.9.2.jar,file:/usr/local/hadoop/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.199.jar
              --driver-class-path=/usr/local/hadoop/share/hadoop/tools/lib/hadoop-aws-2.9.2.jar:/usr/local/hadoop/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.199.jar

应用程序启动,shell等待提示:

Application started and shell waiting for prompt:

   ____              __
  / __/__  ___ _____/ /__
 _\ \/ _ \/ _ `/ __/  '_/
/___/ .__/\_,_/_/ /_/\_\   version 2.4.0
   /_/

Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_191)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val data1 = sc.textFile("s3a://my-bucket-name/README.md")

18/12/25 13:06:40 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 242.1 KB, free 246.7 MB)
18/12/25 13:06:40 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 24.2 KB, free 246.6 MB)
18/12/25 13:06:40 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on hadoop-edge01:3545 (size: 24.2 KB, free: 246.9 MB)
18/12/25 13:06:40 INFO SparkContext: Created broadcast 0 from textFile at <console>:24
data1: org.apache.spark.rdd.RDD[String] = s3a://my-bucket-name/README.md MapPartitionsRDD[1] at textFile at <console>:24

scala> data1.count()

java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StorageStatistics
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:97)
at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:206)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
at org.apache.spark.rdd.RDD.count(RDD.scala:1168)
... 49 elided
Caused by: java.lang.ClassNotFoundException: 
org.apache.hadoop.fs.StorageStatistics
  at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
  ... 77 more

scala>

  1. 所有AWS密钥,秘密密钥均已在hadoop/core-site.xml中进行了设置,如下所述:
  1. All AWS keys, secret-keys was set in hadoop/core-site.xml as described here: Hadoop-AWS module: Integration with Amazon Web Services
  2. The bucket is public - anyone can download (tested with curl -O)
  3. All .jars as you can see was provided by Hadoop itself from /usr/local/hadoop/share/hadoop/tools/lib/ folder
  4. There's no additional settings in spark-defaults.conf - only what was sent in command line
  5. Both jars does not provide this class:

jar tf /usr/local/hadoop/share/hadoop/tools/lib/hadoop-aws-2.9.2.jar | grep org/apache/hadoop/fs/StorageStatistics
(no result)

jar tf /usr/local/hadoop/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.199.jar | grep org/apache/hadoop/fs/StorageStatistics
(no result)

我该怎么办?我忘了加另一个罐子吗? hadoop-awsaws-java-sdk-bundle的确切配置是什么?版本?

What should I do ? Did I forget to add another jar ? What the exact configuration of hadoop-aws and aws-java-sdk-bundle ? versions ?

推荐答案

嗯……我终于找到了问题.

Mmmm.... I found the problem, finally..

主要问题是我为Hadoop预安装了Spark.它是针对Hadoop 2.7及更高版本的v2.4.0预先构建".正如您在上面看到的我为之奋斗时所说的那样,这有点误导标题.实际上,Spark随附了不同版本的hadoop jars./usr/local/spark/jars/中的清单显示它具有:

The main issue is Spark that I have is pre-installed for Hadoop. It's 'v2.4.0 pre-build for Hadoop 2.7 and later'. This is bit of misleading title as you see my struggles with it above. Actually Spark shipped with different version of hadoop jars. The listing from: /usr/local/spark/jars/ shows that it have:

hadoop-common-2.7.3.jar
hadoop-client-2.7.3.jar
....

hadoop-common-2.7.3.jar
hadoop-client-2.7.3.jar
....

它仅丢失:hadoop-aws和aws-java-sdk.我在Maven存储库中进行了一点挖掘: hadoop-aws-v2.7.3 及其依赖项: aws-java-sdk-v1 .7.4 ,瞧!下载了这些jar并将 them 作为参数发送到Spark.像这样:

it only missing: hadoop-aws and aws-java-sdk. I little bit digging in Maven repository: hadoop-aws-v2.7.3 and it dependency: aws-java-sdk-v1.7.4 and voila ! Downloaded those jar and send them as parameters to Spark. Like this:

火花壳
-母纱
-v
--jars文件:/home/aws-java-sdk-1.7.4.jar,文件:/home/hadoop-aws-2.7.3.jar
--driver-class-path =/home/aws-java-sdk-1.7.4.jar:/home/hadoop-aws-2.7.3.jar

spark-shell
--master yarn
-v
--jars file:/home/aws-java-sdk-1.7.4.jar,file:/home/hadoop-aws-2.7.3.jar
--driver-class-path=/home/aws-java-sdk-1.7.4.jar:/home/hadoop-aws-2.7.3.jar

完成工作了!!!

我只是想知道为什么所有来自Hadoop的jar(我都将它们作为参数发送到--jars和--driver-class-path)没有赶上. Spark会以某种方式自动选择罐子,而不是我发送的罐子

I'm just wondering why all jars from Hadoop (and I send all of them as parameter to --jars and --driver-class-path) didn't catch up. Spark somehow automatically choose it jars and not what I send

这篇关于Hadoop 2.9.2,Spark 2.4.0访问AWS s3a存储桶的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆