如何从Apache Spark访问s3a://文件? [英] How to access s3a:// files from Apache Spark?

查看:1508
本文介绍了如何从Apache Spark访问s3a://文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Hadoop 2.6不支持s3a开箱即用,所以我尝试了一系列解决方案和修复,包括:



部署hadoop-aws和aws-java-sdk =>无法读取凭证的环境变量
将hadoop-aws添加到maven =>各种传递依赖冲突

经历了第一手s3a和s3n之间的区别 - 在s3a上传输的7.9GB数据约为7分钟左右在s3n上的7.9GB数据花了73分钟[不幸在这两种情况下都是us-east-1到us-west-1; Redshift和Lambda在这个时候成为了我们东边的1)这是堆栈中非常重要的一部分,以便得到正确的结果,这是值得的。



这里是关键部分,截至2015年12月:


  1. 您的Spark群集需要Hadoop版本2.x或更高版本。如果您使用Spark EC2安装脚本并且可能错过了它,则使用非1.0版本的开关将指定 - hadoop-major-version 2 (它使用CDH 4.2截至撰写本文时为止)。

  2. 您需要包含可能最初看起来已过时的内容。AWS SDK库(2014版本为版本对于迟至2.7.1(稳定)的Hadoop版本,1.7.4):aws-java-sdk 1.7.4。据我所知,使用此功能以及特定的AWS SDK JAR for 1.10.8并未破坏任何内容。
  3. 您还需要hadoop -aws 2.7.1类路径上的JAR。此JAR包含类 org.apache.hadoop.fs.s3a.S3AFileSystem


  4. In spark.properties 您可能需要一些如下设置:

      spark .hadoop.fs.s3a.impl = org.apache.hadoop.fs.s3a.S3AFileSystem 
    spark.hadoop.fs.s3a.access.key = ACCESSKEY
    spark.hadoop.fs.s3a.secret .key = SECRETKEY


我写了当我通过这个过程工作的时候。另外,我已经介绍了所有我沿途遇到的异常情况,以及我认为是每个问题的原因以及如何解决这些问题。


Hadoop 2.6 doesn't support s3a out of the box, so I've tried a series of solutions and fixes, including:

deploy with hadoop-aws and aws-java-sdk => cannot read environment variable for credentials add hadoop-aws into maven => various transitive dependency conflicts

Has anyone successfully make both work?

解决方案

Having experienced first hand the difference between s3a and s3n - 7.9GB of data transferred on s3a was around ~7 minutes while 7.9GB of data on s3n took 73 minutes [us-east-1 to us-west-1 unfortunately in both cases; Redshift and Lambda being us-east-1 at this time] this is a very important piece of the stack to get correct and it's worth the frustration.

Here are the key parts, as of December 2015:

  1. Your Spark cluster will need a Hadoop version 2.x or greater. If you use the Spark EC2 setup scripts and maybe missed it, the switch for using something other than 1.0 is to specify --hadoop-major-version 2 (which uses CDH 4.2 as of this writing).

  2. You'll need to include what may at first seem to be an out of date AWS SDK library (built in 2014 as version 1.7.4) for versions of Hadoop as late as 2.7.1 (stable): aws-java-sdk 1.7.4. As far as I can tell using this along with the specific AWS SDK JARs for 1.10.8 hasn't broken anything.

  3. You'll also need the hadoop-aws 2.7.1 JAR on the classpath. This JAR contains the class org.apache.hadoop.fs.s3a.S3AFileSystem.

  4. In spark.properties you probably want some settings that look like this:

    spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem  
    spark.hadoop.fs.s3a.access.key=ACCESSKEY  
    spark.hadoop.fs.s3a.secret.key=SECRETKEY
    

I've detailed this list in more detail on a post I wrote as I worked my way through this process. In addition I've covered all the exception cases I hit along the way and what I believe to be the cause of each and how to fix them.

这篇关于如何从Apache Spark访问s3a://文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆