如何从 Apache Spark 访问 s3a://文件? [英] How to access s3a:// files from Apache Spark?

查看:30
本文介绍了如何从 Apache Spark 访问 s3a://文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Hadoop 2.6 不支持开箱即用的 s3a,因此我尝试了一系列解决方案和修复,包括:

Hadoop 2.6 doesn't support s3a out of the box, so I've tried a series of solutions and fixes, including:

使用 hadoop-aws 和 aws-java-sdk 进行部署 => 无法读取凭据的环境变量将 hadoop-aws 添加到 maven => 各种传递依赖冲突

deploy with hadoop-aws and aws-java-sdk => cannot read environment variable for credentials add hadoop-aws into maven => various transitive dependency conflicts

有没有人成功地使两者都起作用?

Has anyone successfully make both work?

推荐答案

亲身体验过 s3a 和 s3n 之间的差异 - 在 s3a 上传输 7.9GB 的数据大约需要 7 分钟,而在 s3n 上传输 7.9GB 的数据需要 73 分钟[不幸的是,在这两种情况下,us-east-1 到 us-west-1;Redshift 和 Lambda 目前是 us-east-1] 这是堆栈中非常重要的一部分,需要正确处理,值得沮丧.

Having experienced first hand the difference between s3a and s3n - 7.9GB of data transferred on s3a was around ~7 minutes while 7.9GB of data on s3n took 73 minutes [us-east-1 to us-west-1 unfortunately in both cases; Redshift and Lambda being us-east-1 at this time] this is a very important piece of the stack to get correct and it's worth the frustration.

以下是截至 2015 年 12 月的关键部分:

Here are the key parts, as of December 2015:

  1. 您的 Spark 集群需要 Hadoop 2.x 或更高版本.如果您使用 Spark EC2 安装脚本并且可能错过了它,那么使用 1.0 以外的版本的切换是指定 --hadoop-major-version 2(在撰写本文时使用 CDH 4.2).

  1. Your Spark cluster will need a Hadoop version 2.x or greater. If you use the Spark EC2 setup scripts and maybe missed it, the switch for using something other than 1.0 is to specify --hadoop-major-version 2 (which uses CDH 4.2 as of this writing).

对于 Hadoop 2.7.1(稳定版)版本,您需要包含起初似乎是过时的 AWS 开发工具包库(于 2014 年构建为 1.7.4 版):aws-java-sdk 1.7.4.据我所知,将它与 1.10.8 的特定 AWS 开发工具包 JAR 一起使用并没有破坏任何东西.

You'll need to include what may at first seem to be an out of date AWS SDK library (built in 2014 as version 1.7.4) for versions of Hadoop as late as 2.7.1 (stable): aws-java-sdk 1.7.4. As far as I can tell using this along with the specific AWS SDK JARs for 1.10.8 hasn't broken anything.

您还需要在类路径上安装 hadoop-aws 2.7.1 JAR.这个 JAR 包含类 org.apache.hadoop.fs.s3a.S3AFileSystem.

You'll also need the hadoop-aws 2.7.1 JAR on the classpath. This JAR contains the class org.apache.hadoop.fs.s3a.S3AFileSystem.

spark.properties 中,您可能需要一些如下所示的设置:

In spark.properties you probably want some settings that look like this:

spark.hadoop.fs.s3a.access.key=ACCESSKEY spark.hadoop.fs.s3a.secret.key=SECRETKEY

如果您使用的是带有 spark 的 hadoop 2.7 版本,那么 aws 客户端使用 V2 作为默认身份验证签名.并且所有新的 aws 区域仅支持 V4 协议.要使用 V4,请在 spark-submit 中传递这些 conf,还必须指定端点(格式 - s3..amazonaws.com).

If you are using hadoop 2.7 version with spark then the aws client uses V2 as default auth signature. And all the new aws region support only V4 protocol. To use V4 pass these conf in spark-submit and also endpoint (format - s3.<region>.amazonaws.com) must be specified.

--conf "spark.executor.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true

--conf "spark.driver.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true

我在 我写的帖子 在我完成这个过程的过程中.此外,我还介绍了一路上遇到的所有异常情况,以及我认为导致每个异常情况的原因以及解决方法.

I've detailed this list in more detail on a post I wrote as I worked my way through this process. In addition I've covered all the exception cases I hit along the way and what I believe to be the cause of each and how to fix them.

这篇关于如何从 Apache Spark 访问 s3a://文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆