Hadoop 2.9.2、Spark 2.4.0 访问 AWS s3a 存储桶 [英] Hadoop 2.9.2, Spark 2.4.0 access AWS s3a bucket

查看:31
本文介绍了Hadoop 2.9.2、Spark 2.4.0 访问 AWS s3a 存储桶的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

已经有几天了,但我无法使用 Spark 从公共 Amazon Bucket 下载:(

It's been a couple of days but I could not download from public Amazon Bucket using Spark :(

这是spark-shell命令:

spark-shell  --master yarn
              -v
              --jars file:/usr/local/hadoop/share/hadoop/tools/lib/hadoop-aws-2.9.2.jar,file:/usr/local/hadoop/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.199.jar
              --driver-class-path=/usr/local/hadoop/share/hadoop/tools/lib/hadoop-aws-2.9.2.jar:/usr/local/hadoop/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.199.jar

应用程序启动,shell 等待提示:

Application started and shell waiting for prompt:

   ____              __
  / __/__  ___ _____/ /__
 _ / _ / _ `/ __/  '_/
/___/ .__/\_,_/_/ /_/\_   version 2.4.0
   /_/

Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_191)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val data1 = sc.textFile("s3a://my-bucket-name/README.md")

18/12/25 13:06:40 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 242.1 KB, free 246.7 MB)
18/12/25 13:06:40 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 24.2 KB, free 246.6 MB)
18/12/25 13:06:40 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on hadoop-edge01:3545 (size: 24.2 KB, free: 246.9 MB)
18/12/25 13:06:40 INFO SparkContext: Created broadcast 0 from textFile at <console>:24
data1: org.apache.spark.rdd.RDD[String] = s3a://my-bucket-name/README.md MapPartitionsRDD[1] at textFile at <console>:24

scala> data1.count()

java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StorageStatistics
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:97)
at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:206)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
at org.apache.spark.rdd.RDD.count(RDD.scala:1168)
... 49 elided
Caused by: java.lang.ClassNotFoundException: 
org.apache.hadoop.fs.StorageStatistics
  at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
  ... 77 more

scala>

  1. 所有 AWS 密钥、密钥都在 hadoop/core-site.xml 中设置,如下所述:Hadoop-AWS 模块:与亚马逊网络服务集成
  2. 存储桶是公开的 - 任何人都可以下载(使用 curl -O 测试)
  3. 您所看到的所有 .jar 文件都是由 Hadoop 本身从 /usr/local/hadoop/share/hadoop/tools/lib/ 文件夹中提供的
  4. spark-defaults.conf 中没有其他设置 - 只有在命令行中发送的内容
  5. 两个 jars 都没有提供这个类:

  1. All AWS keys, secret-keys was set in hadoop/core-site.xml as described here: Hadoop-AWS module: Integration with Amazon Web Services
  2. The bucket is public - anyone can download (tested with curl -O)
  3. All .jars as you can see was provided by Hadoop itself from /usr/local/hadoop/share/hadoop/tools/lib/ folder
  4. There's no additional settings in spark-defaults.conf - only what was sent in command line
  5. Both jars does not provide this class:

jar tf /usr/local/hadoop/share/hadoop/tools/lib/hadoop-aws-2.9.2.jar | grep org/apache/hadoop/fs/StorageStatistics
(no result)

jar tf /usr/local/hadoop/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.199.jar | grep org/apache/hadoop/fs/StorageStatistics
(no result)

我该怎么办?我忘记添加另一个罐子了吗?hadoop-awsaws-java-sdk-bundle 的具体配置是什么?版本?

What should I do ? Did I forget to add another jar ? What the exact configuration of hadoop-aws and aws-java-sdk-bundle ? versions ?

推荐答案

嗯.... 终于找到问题了..

Mmmm.... I found the problem, finally..

主要问题是我为 Hadoop 预先安装了 Spark.它是适用于 Hadoop 2.7 及更高版本的 v2.4.0 预构建".这是一个有点误导性的标题,因为你在上面看到了我的挣扎.实际上,Spark 附带了不同 版本的 hadoop jar.来自:/usr/local/spark/jars/的清单显示它有:

The main issue is Spark that I have is pre-installed for Hadoop. It's 'v2.4.0 pre-build for Hadoop 2.7 and later'. This is bit of misleading title as you see my struggles with it above. Actually Spark shipped with different version of hadoop jars. The listing from: /usr/local/spark/jars/ shows that it have:

hadoop-common-2.7.3.jar
hadoop-client-2.7.3.jar
....

hadoop-common-2.7.3.jar
hadoop-client-2.7.3.jar
....

它只缺少:hadoop-aws 和 aws-java-sdk.我在 Maven 存储库中进行了一些挖掘:hadoop-aws-v2.7.3 及其依赖项:aws-java-sdk-v1.7.4 瞧!下载这些 jar 并将它们作为参数发送给 Spark.像这样:

it only missing: hadoop-aws and aws-java-sdk. I little bit digging in Maven repository: hadoop-aws-v2.7.3 and it dependency: aws-java-sdk-v1.7.4 and voila ! Downloaded those jar and send them as parameters to Spark. Like this:

火花壳
--master 纱线
-v
--jars 文件:/home/aws-java-sdk-1.7.4.jar,文件:/home/hadoop-aws-2.7.3.jar
--driver-class-path=/home/aws-java-sdk-1.7.4.jar:/home/hadoop-aws-2.7.3.jar

spark-shell
--master yarn
-v
--jars file:/home/aws-java-sdk-1.7.4.jar,file:/home/hadoop-aws-2.7.3.jar
--driver-class-path=/home/aws-java-sdk-1.7.4.jar:/home/hadoop-aws-2.7.3.jar

完成任务!!!

我只是想知道为什么 Hadoop 中的所有 jar(并且我将所有 jar 作为参数发送到 --jars 和 --driver-class-path)没有赶上.Spark以某种方式自动选择它的罐子而不是我发送的

I'm just wondering why all jars from Hadoop (and I send all of them as parameter to --jars and --driver-class-path) didn't catch up. Spark somehow automatically choose it jars and not what I send

这篇关于Hadoop 2.9.2、Spark 2.4.0 访问 AWS s3a 存储桶的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆