如何在Spark Shell中将s3与Apache Spark 2.2结合使用 [英] How to use s3 with Apache spark 2.2 in the Spark shell

查看:102
本文介绍了如何在Spark Shell中将s3与Apache Spark 2.2结合使用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在Spark Shell中从Amazon AWS S3存储桶加载数据.

I'm trying to load data from an Amazon AWS S3 bucket, while in the Spark shell.

我已经咨询了以下资源:

I have consulted the following resources:

使用Apache Spark解析Amazon S3中的文件

如何从Apache Spark访问s3a://文件?/a>

How to access s3a:// files from Apache Spark?

Hortonworks Spark 1.6和S3

Cloudera

自定义s3端点

我已经下载并解压缩了 Apache Spark 2.2.0 .在conf/spark-defaults中,我有以下内容(请注意,我替换了access-keysecret-key):

I have downloaded and unzipped Apache Spark 2.2.0. In conf/spark-defaults I have the following (note I replaced access-key and secret-key):

spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.access.key=access-key 
spark.hadoop.fs.s3a.secret.key=secret-key

我已经从 mvnrepository 下载了hadoop-aws-2.8.1.jaraws-java-sdk-1.11.179.jar,并将它们放置在jars/目录中.然后,我启动Spark shell:

I have downloaded hadoop-aws-2.8.1.jar and aws-java-sdk-1.11.179.jar from mvnrepository, and placed them in the jars/ directory. I then start the Spark shell:

bin/spark-shell --jars jars/hadoop-aws-2.8.1.jar,jars/aws-java-sdk-1.11.179.jar

在外壳中,这是我尝试从S3存储桶加载数据的方法:

In the shell, here is how I try to load data from the S3 bucket:

val p = spark.read.textFile("s3a://sparkcookbook/person")

这是导致的错误:

java.lang.NoClassDefFoundError: org/apache/hadoop/fs/GlobalStorageStatistics$StorageStatisticsProvider
  at java.lang.Class.forName0(Native Method)
  at java.lang.Class.forName(Class.java:348)
  at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
  at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
  at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
  at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
  at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
  at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)

当我尝试按以下方式启动Spark shell时:

When I instead try to start the Spark shell as follows:

bin/spark-shell --packages org.apache.hadoop:hadoop-aws:2.8.1

然后我遇到两个错误:一个是中断程序启动时发生的错误,另一个是我尝试加载数据时发生的错误.这是第一个:

Then I get two errors: one when the interperter starts, and another when I try to load the data. Here is the first:

:: problems summary ::
:::: ERRORS
    unknown resolver null

    unknown resolver null

    unknown resolver null

    unknown resolver null

    unknown resolver null

    unknown resolver null


:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS

这是第二个:

val p = spark.read.textFile("s3a://sparkcookbook/person")
java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
  at org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:195)
  at org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:216)
  at org.apache.hadoop.fs.s3a.S3AInstrumentation.<init>(S3AInstrumentation.java:139)
  at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:174)
  at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
  at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
  at org.apache.spark.sql.execution.datasources.DataSource.hasMetadata(DataSource.scala:301)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:344)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
  at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:506)
  at org.apache.spark.sql.DataFrameReader.textFile(DataFrameReader.scala:542)
  at org.apache.spark.sql.DataFrameReader.textFile(DataFrameReader.scala:515)

有人可以建议如何使它正常工作吗?谢谢.

Could someone suggest how to get this working? Thanks.

推荐答案

如果使用的是Apache Spark 2.2.0,则应使用hadoop-aws-2.7.3.jaraws-java-sdk-1.7.4.jar.

If you are using Apache Spark 2.2.0, then you should use hadoop-aws-2.7.3.jar and aws-java-sdk-1.7.4.jar.

$ spark-shell --jars jars/hadoop-aws-2.7.3.jar,jars/aws-java-sdk-1.7.4.jar

之后,当您尝试从Shell中的S3存储桶中加载数据时,便可以这样做.

After that, when you will try to load data from S3 bucket in the shell, you will be able to do so.

这篇关于如何在Spark Shell中将s3与Apache Spark 2.2结合使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆