如何在 Spark shell 中将 s3 与 Apache spark 2.2 一起使用 [英] How to use s3 with Apache spark 2.2 in the Spark shell

查看:24
本文介绍了如何在 Spark shell 中将 s3 与 Apache spark 2.2 一起使用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在 Spark shell 中从 Amazon AWS S3 存储桶加载数据.

I'm trying to load data from an Amazon AWS S3 bucket, while in the Spark shell.

我查阅了以下资源:

使用 Apache Spark 从 Amazon S3 解析文件

如何从 Apache Spark 访问 s3a://文件?

Hortonworks Spark 1.6 和 S3

Cloudera

自定义 s3 端点

我已下载并解压缩 Apache Spark 2.2.0.在 conf/spark-defaults 我有以下内容(注意我替换了 access-keysecret-key):

I have downloaded and unzipped Apache Spark 2.2.0. In conf/spark-defaults I have the following (note I replaced access-key and secret-key):

spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.access.key=access-key 
spark.hadoop.fs.s3a.secret.key=secret-key

我已经从 mvnrepository,并将它们放在 jars/ 目录中.然后我启动 Spark shell:

I have downloaded hadoop-aws-2.8.1.jar and aws-java-sdk-1.11.179.jar from mvnrepository, and placed them in the jars/ directory. I then start the Spark shell:

bin/spark-shell --jars jars/hadoop-aws-2.8.1.jar,jars/aws-java-sdk-1.11.179.jar

在 shell 中,这是我尝试从 S3 存储桶加载数据的方法:

In the shell, here is how I try to load data from the S3 bucket:

val p = spark.read.textFile("s3a://sparkcookbook/person")

这是导致的错误:

java.lang.NoClassDefFoundError: org/apache/hadoop/fs/GlobalStorageStatistics$StorageStatisticsProvider
  at java.lang.Class.forName0(Native Method)
  at java.lang.Class.forName(Class.java:348)
  at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
  at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
  at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
  at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
  at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
  at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)

当我尝试按如下方式启动 Spark shell 时:

When I instead try to start the Spark shell as follows:

bin/spark-shell --packages org.apache.hadoop:hadoop-aws:2.8.1

然后我收到两个错误:一个是在 interperter 启动时,另一个是在我尝试加载数据时.这是第一个:

Then I get two errors: one when the interperter starts, and another when I try to load the data. Here is the first:

:: problems summary ::
:::: ERRORS
    unknown resolver null

    unknown resolver null

    unknown resolver null

    unknown resolver null

    unknown resolver null

    unknown resolver null


:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS

这是第二个:

val p = spark.read.textFile("s3a://sparkcookbook/person")
java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
  at org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:195)
  at org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:216)
  at org.apache.hadoop.fs.s3a.S3AInstrumentation.<init>(S3AInstrumentation.java:139)
  at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:174)
  at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
  at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
  at org.apache.spark.sql.execution.datasources.DataSource.hasMetadata(DataSource.scala:301)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:344)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
  at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:506)
  at org.apache.spark.sql.DataFrameReader.textFile(DataFrameReader.scala:542)
  at org.apache.spark.sql.DataFrameReader.textFile(DataFrameReader.scala:515)

有人可以建议如何让它工作吗?谢谢.

Could someone suggest how to get this working? Thanks.

推荐答案

如果您使用的是 Apache Spark 2.2.0,那么您应该使用 hadoop-aws-2.7.3.jar>aws-java-sdk-1.7.4.jar.

If you are using Apache Spark 2.2.0, then you should use hadoop-aws-2.7.3.jar and aws-java-sdk-1.7.4.jar.

$ spark-shell --jars jars/hadoop-aws-2.7.3.jar,jars/aws-java-sdk-1.7.4.jar

之后,当您尝试从 shell 中的 S3 存储桶加载数据时,您将能够这样做.

After that, when you will try to load data from S3 bucket in the shell, you will be able to do so.

这篇关于如何在 Spark shell 中将 s3 与 Apache spark 2.2 一起使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆