NoClassDefFoundError: org/apache/hadoop/fs/StreamCapabilities 同时使用 spark 读取 s3 数据 [英] NoClassDefFoundError: org/apache/hadoop/fs/StreamCapabilities while reading s3 Data with spark

查看:33
本文介绍了NoClassDefFoundError: org/apache/hadoop/fs/StreamCapabilities 同时使用 spark 读取 s3 数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在我的本地开发机器上运行一个简单的 Spark 作业(通过 Intellij)从 Amazon s3 读取数据.

我的build.sbt文件:

scalaVersion := "2.11.12"libraryDependencies ++= Seq("org.apache.spark" %% "spark-core" % "2.3.1","org.apache.spark" %% "spark-sql" % "2.3.1","com.amazonaws" % "aws-java-sdk" % "1.11.407",org.apache.hadoop"%hadoop-aws"%3.1.1")

我的代码片段:

val spark = SparkSession.builder.appName("测试").master("本地[2]").getOrCreate()火花.sparkContext.hadoop配置.set("fs.s3n.impl","org.apache.hadoop.fs.s3native.NativeS3FileSystem")val schema_p = ...val df = 火花.读.schema(schema_p).parquet("s3a:///...")

我得到以下异常:

线程main"中的异常 java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StreamCapabilities在 java.lang.ClassLoader.defineClass1(Native Method)在 java.lang.ClassLoader.defineClass(ClassLoader.java:763)在 java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)在 java.net.URLClassLoader.defineClass(URLClassLoader.java:467)在 java.net.URLClassLoader.access$100(URLClassLoader.java:73)在 java.net.URLClassLoader$1.run(URLClassLoader.java:368)在 java.net.URLClassLoader$1.run(URLClassLoader.java:362)在 java.security.AccessController.doPrivileged(Native Method)在 java.net.URLClassLoader.findClass(URLClassLoader.java:361)在 java.lang.ClassLoader.loadClass(ClassLoader.java:424)在 sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)在 java.lang.ClassLoader.loadClass(ClassLoader.java:357)在 java.lang.Class.forName0(Native Method)在 java.lang.Class.forName(Class.java:348)在 org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2093)在 org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2058)在 org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2152)在 org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2580)在 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2593)在 org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)在 org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2632)在 org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2614)在 org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)在 org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)在 org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)在 org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)在 org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)在 org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)在 org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:622)在 org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:606)在 Test$.delayedEndpoint$Test$1(Test.scala:27​​)在 Test$delayedInit$body.apply(Test.scala:4)在 scala.Function0$class.apply$mcV$sp(Function0.scala:34)在 scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)在 scala.App$$anonfun$main$1.apply(App.scala:76)在 scala.App$$anonfun$main$1.apply(App.scala:76)在 scala.collection.immutable.List.foreach(List.scala:392)在 scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)在 scala.App$class.main(App.scala:76)在 Test$.main(Test.scala:4)在 Test.main(Test.scala)引起:java.lang.ClassNotFoundException:org.apache.hadoop.fs.StreamCapabilities在 java.net.URLClassLoader.findClass(URLClassLoader.java:381)在 java.lang.ClassLoader.loadClass(ClassLoader.java:424)在 sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)在 java.lang.ClassLoader.loadClass(ClassLoader.java:357)……还有 41 个

s3a:/// 替换为 s3:/// 时,出现另一个错误:No FileSystem for scheme: s3

由于我是 AWS 新手,我不知道我是否应该使用 s3:///s3a:///s3n:///.我已经使用 aws-cli 设置了我的 AWS 凭证.

我的机器上没有安装任何 Spark.

预先感谢您的帮助

解决方案

我会先看看 S3A 故障排除文档

<块引用>

不要尝试插入"比 Hadoop 版本构建的新版本的 AWS 开发工具包,无论您遇到什么问题,更改 AWS 开发工具包版本都不会解决问题,只会更改您看到的堆栈跟踪.

无论您在本地 Spark 安装中使用什么版本的 hadoop-JAR,您都需要完全具有相同版本的 hadoop-aws,并且完全相同的版本构建 hadoop-aws 的 aws SDK.试试 mvnrepository 了解详情.

I would like to run a simple spark job on my local dev machine (through Intellij) reading data from Amazon s3.

my build.sbt file:

scalaVersion := "2.11.12"

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % "2.3.1",
  "org.apache.spark" %% "spark-sql" % "2.3.1",
  "com.amazonaws" % "aws-java-sdk" % "1.11.407",
  "org.apache.hadoop" % "hadoop-aws" % "3.1.1"
)

my code snippet:

val spark = SparkSession
    .builder
    .appName("test")
    .master("local[2]")
    .getOrCreate()

  spark
    .sparkContext
    .hadoopConfiguration
    .set("fs.s3n.impl","org.apache.hadoop.fs.s3native.NativeS3FileSystem")

  val schema_p = ...

  val df = spark
    .read
    .schema(schema_p)
    .parquet("s3a:///...")

And I get the following exception:

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StreamCapabilities
    at java.lang.ClassLoader.defineClass1(Native Method)
    at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
    at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
    at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
    at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:348)
    at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2093)
    at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2058)
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2152)
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2580)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2593)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2632)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2614)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
    at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
    at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
    at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:622)
    at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:606)
    at Test$.delayedEndpoint$Test$1(Test.scala:27)
    at Test$delayedInit$body.apply(Test.scala:4)
    at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
    at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
    at scala.App$$anonfun$main$1.apply(App.scala:76)
    at scala.App$$anonfun$main$1.apply(App.scala:76)
    at scala.collection.immutable.List.foreach(List.scala:392)
    at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
    at scala.App$class.main(App.scala:76)
    at Test$.main(Test.scala:4)
    at Test.main(Test.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.StreamCapabilities
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    ... 41 more

When replacing s3a:/// to s3:/// I get another error: No FileSystem for scheme: s3

As I am new to AWS, I do not know if I should user s3:///, s3a:/// or s3n:///. I have already setup my AWS credentials with aws-cli.

I have not any Spark installation on my machine.

Thanks in advance for your help

解决方案

I would start by looking at the S3A troubleshooting docs

Do not attempt to "drop in" a newer version of the AWS SDK than that which the Hadoop version was built with Whatever problem you have, changing the AWS SDK version will not fix things, only change the stack traces you see.

whatever version of the hadoop- JARs you have on your local spark installation, you need to have exactly the same version of hadoop-aws, and exactly the same version of the aws SDK which hadoop-aws was built with. Try mvnrepository for the details.

这篇关于NoClassDefFoundError: org/apache/hadoop/fs/StreamCapabilities 同时使用 spark 读取 s3 数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆