Spark + s3 - 错误 - java.lang.ClassNotFoundException:找不到类 org.apache.hadoop.fs.s3a.S3AFileSystem [英] Spark + s3 - error - java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
问题描述
我有一个 spark ec2 集群,我正在从 Zeppelin 笔记本提交 pyspark 程序.我已经加载了 hadoop-aws-2.7.3.jar 和 aws-java-sdk-1.11.179.jar 并将它们放在 spark 实例的/opt/spark/jars 目录中.我得到一个 java.lang.NoClassDefFoundError: com/amazonaws/AmazonServiceException
为什么火花没有看到罐子?我是否必须在所有从站中进行 jars 并为主站和从站指定 spark-defaults.conf ?是否需要在 zeppelin 中配置一些东西来识别新的 jar 文件?
我已将 jar 文件/opt/spark/jars 放在 spark master 上.我创建了一个 spark-defaults.conf 并添加了行
spark.hadoop.fs.s3a.access.key [访问密钥]spark.hadoop.fs.s3a.secret.key [秘钥]spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystemspark.driver.extraClassPath/opt/spark/jars/hadoop-aws-2.7.3.jar:/opt/spark/jars/aws-java-sdk-1.11.179.jar
我让 zeppelin 解释器向 spark master 发送 spark 提交.
我也将 jars 也放在了 slaves 的/opt/spark/jars 中,但没有创建 spark-deafults.conf.
%spark.pyspark#导入必要的库从 pyspark 导入 SparkContext从 pyspark.sql 导入 SparkSession从 pyspark.sql.functions 导入 *从 pyspark.sql.types 导入 StringType从 pyspark 导入 SQLContext从 itertools 导入 islice从 pyspark.sql.functions 导入列# 添加aws凭据sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "[访问密钥]")sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "[SECRET KEY]")sc._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")#创建上下文sqlContext = SQLContext(sc)#读取第一个csv文件并将其存储在RDD中rdd1= sc.textFile("s3a://filepath/baby-names.csv").map(lambda line: line.split(","))#删除第一行,因为它包含标题rdd1 = rdd1.mapPartitionsWithIndex(lambda idx, it: islice(it, 1, None) if idx == 0 else it)#将RDD转换为数据帧df1 = rdd1.toDF(['年份','姓名','百分比','性别'])#打印数据框df1.show()
抛出的错误:
<预><代码>Py4JJavaError:调用 z:org.apache.spark.api.python.PythonRDD.runJob 时发生错误.:org.apache.spark.SparkException:由于阶段失败而中止作业:阶段 1.0 中的任务 0 失败 4 次,最近失败:阶段 1.0 中丢失任务 0.3(TID 7、10.11.93.90、执行程序 1):java.lang.NoClassDefFoundError: com/amazonaws/AmazonServiceException在 java.lang.Class.forName0(Native Method)在 java.lang.Class.forName(Class.java:348)在 org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)在 org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)在 org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)在 org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)在 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)在 org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)在 org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)在 org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)在 org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)在 org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)在 org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:108)在 org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)在 org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:267)在 org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:266)在 org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:224)在 org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:95)在 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)在 org.apache.spark.rdd.RDD.iterator(RDD.scala:288)在 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)在 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)在 org.apache.spark.rdd.RDD.iterator(RDD.scala:288)在 org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)在 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)在 org.apache.spark.rdd.RDD.iterator(RDD.scala:288)在 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)在 org.apache.spark.scheduler.Task.run(Task.scala:123)在 org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)在 org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)在 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)在 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)在 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)在 java.lang.Thread.run(Thread.java:748)引起:java.lang.ClassNotFoundException:com.amazonaws.AmazonServiceException在 java.net.URLClassLoader.findClass(URLClassLoader.java:382)在 java.lang.ClassLoader.loadClass(ClassLoader.java:424)在 sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)在 java.lang.ClassLoader.loadClass(ClassLoader.java:357)... 34 更多驱动程序堆栈跟踪:在 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)在 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)在 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)在 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)在 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)在 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)在 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)在 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)在 scala.Option.foreach(Option.scala:257)在 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)在 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)在 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)在 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)在 org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)在 org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)在 org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)在 org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)在 org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)在 org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:153)在 org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)在 sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)在 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)在 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)在 java.lang.reflect.Method.invoke(Method.java:498)在 py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)在 py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)在 py4j.Gateway.invoke(Gateway.java:282)在 py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)在 py4j.commands.CallCommand.execute(CallCommand.java:79)在 py4j.GatewayConnection.run(GatewayConnection.java:238)在 java.lang.Thread.run(Thread.java:748)引起:java.lang.NoClassDefFoundError:com/amazonaws/AmazonServiceException在 java.lang.Class.forName0(Native Method)在 java.lang.Class.forName(Class.java:348)在 org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)在 org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)在 org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)在 org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)在 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)在 org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)在 org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)在 org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)在 org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)在 org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)在 org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:108)在 org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)在 org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:267)在 org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:266)在 org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:224)在 org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:95)在 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)在 org.apache.spark.rdd.RDD.iterator(RDD.scala:288)在 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)在 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)在 org.apache.spark.rdd.RDD.iterator(RDD.scala:288)在 org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)在 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)在 org.apache.spark.rdd.RDD.iterator(RDD.scala:288)在 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)在 org.apache.spark.scheduler.Task.run(Task.scala:123)在 org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)在 org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)在 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)在 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)在 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)... 1 更多引起:java.lang.ClassNotFoundException:com.amazonaws.AmazonServiceException在 java.net.URLClassLoader.findClass(URLClassLoader.java:382)在 java.lang.ClassLoader.loadClass(ClassLoader.java:424)在 sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)在 java.lang.ClassLoader.loadClass(ClassLoader.java:357)... 34 更多我能够解决上述问题,以确保根据我正在运行的 spark hadoop 版本,我拥有正确版本的 hadoop aws jar,下载aws-java-sdk
的正确版本,最后下载依赖 jets3t 库
在/opt/spark/jars 中
sudo wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.11.30/aws-java-sdk-1.11.30.jar须藤 wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.7.3/hadoop-aws-2.7.3.jar须藤 wget https://repo1.maven.org/maven2/net/java/dev/jets3t/jets3t/0.9.4/jets3t-0.9.4.jar
测试一下
scala>sc.hadoopConfiguration.set(fs.s3n.awsAccessKeyId", [访问密钥 ID])标度>sc.hadoopConfiguration.set(fs.s3n.awsSecretAccessKey", [SECRET ACCESS KEY] )标度>val myRDD = sc.textFile("s3n://adp-px/baby-names.csv")标度>myRDD.count()res2:长 = 49
I have a spark ec2 cluster where I am submitting a pyspark program from a Zeppelin notebook. I have loaded the hadoop-aws-2.7.3.jar and aws-java-sdk-1.11.179.jar and place them in the /opt/spark/jars directory of the spark instances. I get a java.lang.NoClassDefFoundError: com/amazonaws/AmazonServiceException
Why is spark not seeing the jars? Do I have to have to jars in all the slaves and specify a spark-defaults.conf for the master and slaves? Is there something that needs to be configured in zeppelin to recognize the new jar files?
I have placed jar files /opt/spark/jars on the spark master. I have created a spark-defaults.conf and added the lines
spark.hadoop.fs.s3a.access.key [ACCESS KEY]
spark.hadoop.fs.s3a.secret.key [SECRET KEY]
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.driver.extraClassPath /opt/spark/jars/hadoop-aws-2.7.3.jar:/opt/spark/jars/aws-java-sdk-1.11.179.jar
I have zeppelin interpreter sending a spark submit to the spark master.
I have also placed the jars in the /opt/spark/jars in the slaves too but did not create a spark-deafults.conf.
%spark.pyspark
#importing necessary libaries
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import StringType
from pyspark import SQLContext
from itertools import islice
from pyspark.sql.functions import col
# add aws credentials
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "[ACCESS KEY]")
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "[SECRET KEY]")
sc._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
#creating the context
sqlContext = SQLContext(sc)
#reading the first csv file and store it in an RDD
rdd1= sc.textFile("s3a://filepath/baby-names.csv").map(lambda line: line.split(","))
#removing the first row as it contains the header
rdd1 = rdd1.mapPartitionsWithIndex(
lambda idx, it: islice(it, 1, None) if idx == 0 else it
)
#converting the RDD into a dataframe
df1 = rdd1.toDF(['year','name', 'percent', 'sex'])
#print the dataframe
df1.show()
Error thrown:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 7, 10.11.93.90, executor 1): java.lang.NoClassDefFoundError: com/amazonaws/AmazonServiceException
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:108)
at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:267)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:266)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:224)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:95)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: com.amazonaws.AmazonServiceException
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 34 more
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:153)
at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NoClassDefFoundError: com/amazonaws/AmazonServiceException
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:108)
at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:267)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:266)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:224)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:95)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
Caused by: java.lang.ClassNotFoundException: com.amazonaws.AmazonServiceException
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 34 more
I was able to address the above to make sure I had the correct versions of the hadoop aws jar per the version of spark hadoop that I was running, downloading the correct version of aws-java-sdk
, and lastly downloading the dependency jets3t library
In the /opt/spark/jars
sudo wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.11.30/aws-java-sdk-1.11.30.jar
sudo wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.7.3/hadoop-aws-2.7.3.jar
sudo wget https://repo1.maven.org/maven2/net/java/dev/jets3t/jets3t/0.9.4/jets3t-0.9.4.jar
Testing it out
scala> sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", [ACCESS KEY ID])
scala> sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", [SECRET ACCESS KEY] )
scala> val myRDD = sc.textFile("s3n://adp-px/baby-names.csv")
scala> myRDD.count()
res2: Long = 49
这篇关于Spark + s3 - 错误 - java.lang.ClassNotFoundException:找不到类 org.apache.hadoop.fs.s3a.S3AFileSystem的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!