如何在Spark Shell中将s3与Apache Spark 2.2结合使用 [英] How to use s3 with Apache spark 2.2 in the Spark shell
问题描述
我正在尝试在Spark Shell中从Amazon AWS S3存储桶加载数据.
I'm trying to load data from an Amazon AWS S3 bucket, while in the Spark shell.
我已经咨询了以下资源:
I have consulted the following resources:
How to access s3a:// files from Apache Spark?
我已经下载并解压缩了 Apache Spark 2.2.0 .在conf/spark-defaults
中,我有以下内容(请注意,我替换了access-key
和secret-key
):
I have downloaded and unzipped Apache Spark 2.2.0. In conf/spark-defaults
I have the following (note I replaced access-key
and secret-key
):
spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.access.key=access-key
spark.hadoop.fs.s3a.secret.key=secret-key
我已经从 mvnrepository 下载了hadoop-aws-2.8.1.jar
和aws-java-sdk-1.11.179.jar
,并将它们放置在jars/
目录中.然后,我启动Spark shell:
I have downloaded hadoop-aws-2.8.1.jar
and aws-java-sdk-1.11.179.jar
from mvnrepository, and placed them in the jars/
directory. I then start the Spark shell:
bin/spark-shell --jars jars/hadoop-aws-2.8.1.jar,jars/aws-java-sdk-1.11.179.jar
在外壳中,这是我尝试从S3存储桶加载数据的方法:
In the shell, here is how I try to load data from the S3 bucket:
val p = spark.read.textFile("s3a://sparkcookbook/person")
这是导致的错误:
java.lang.NoClassDefFoundError: org/apache/hadoop/fs/GlobalStorageStatistics$StorageStatisticsProvider
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
当我尝试按以下方式启动Spark shell时:
When I instead try to start the Spark shell as follows:
bin/spark-shell --packages org.apache.hadoop:hadoop-aws:2.8.1
然后我遇到两个错误:一个是中断程序启动时发生的错误,另一个是我尝试加载数据时发生的错误.这是第一个:
Then I get two errors: one when the interperter starts, and another when I try to load the data. Here is the first:
:: problems summary ::
:::: ERRORS
unknown resolver null
unknown resolver null
unknown resolver null
unknown resolver null
unknown resolver null
unknown resolver null
:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
这是第二个:
val p = spark.read.textFile("s3a://sparkcookbook/person")
java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
at org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:195)
at org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:216)
at org.apache.hadoop.fs.s3a.S3AInstrumentation.<init>(S3AInstrumentation.java:139)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:174)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.sql.execution.datasources.DataSource.hasMetadata(DataSource.scala:301)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:344)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:506)
at org.apache.spark.sql.DataFrameReader.textFile(DataFrameReader.scala:542)
at org.apache.spark.sql.DataFrameReader.textFile(DataFrameReader.scala:515)
有人可以建议如何使它正常工作吗?谢谢.
Could someone suggest how to get this working? Thanks.
推荐答案
如果使用的是Apache Spark 2.2.0,则应使用hadoop-aws-2.7.3.jar
和aws-java-sdk-1.7.4.jar
.
If you are using Apache Spark 2.2.0, then you should use hadoop-aws-2.7.3.jar
and aws-java-sdk-1.7.4.jar
.
$ spark-shell --jars jars/hadoop-aws-2.7.3.jar,jars/aws-java-sdk-1.7.4.jar
之后,当您尝试从Shell中的S3存储桶中加载数据时,便可以这样做.
After that, when you will try to load data from S3 bucket in the shell, you will be able to do so.
这篇关于如何在Spark Shell中将s3与Apache Spark 2.2结合使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!