如何在aws us-east-2上将S3a与Spark 2.1.0一起使用? [英] How do you use s3a with spark 2.1.0 on aws us-east-2?

查看:98
本文介绍了如何在aws us-east-2上将S3a与Spark 2.1.0一起使用?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

背景

我一直在努力为自己设置一个灵活的设置,以便在具有docker swarm模式的aws上使用spark.我一直在使用的docker映像配置为使用最新的spark,当时在Hadoop 2.7.3中为2.1.0,可在

I have been working on getting a flexible setup for myself to use spark on aws with docker swarm mode. The docker image I have been using is configured to use the latest spark, which at the time is 2.1.0 with Hadoop 2.7.3, and is available at jupyter/pyspark-notebook.

这是有效的,我一直在测试我计划使用的各种连接路径.我遇到的问题是围绕与s3交互的正确方法的不确定性.我已经按照有关如何使用s3a协议和s3n协议为spark sws 3上的数据提供Spark依赖关系的方法进行了跟踪.

This is working, and I have been just going through to test out the various connectivity paths that I plan to use. The issue I came across is with the uncertainty around the correct way to interact with s3. I have followed the trail on how to provide the dependencies for spark to connect to data on aws s3 using the s3a protocol, vs s3n protocol.

我终于遇到了 hadoop aws指南,并认为我正在遵循如何提供配置.但是,我仍然收到400 Bad Request错误,如

I finally came across the hadoop aws guide and thought I was following how to provide the configuration. However, I was still receiving the 400 Bad Request error, as seen in this question that describes how to fix it by defining the endpoint, which I had already done.

us-east-2上,我最终偏离了标准配置,这使我不确定jar文件是否存在问题.为了消除区域问题,我将内容设置在常规的us-east-1区域上,最终能够与s3a连接.因此,我将问题范围缩小到了该地区,但以为我正在做在其他地区进行操作所需的一切.

I ended up being too far off the standard configuration by being on us-east-2, making me uncertain if I had a problem with the jar files. To eliminate the region issue, I set things back up on the regular us-east-1 region, and I was able to finally connect with s3a. So I have narrowed down the problem to the region, but thought I was doing everything required to operate on the other region.

问题

在spark中使用hadoop的配置变量以使用us-east-2的正确方法是什么?

What is the correct way to use the configuration variables for hadoop in spark to use us-east-2?

注意:本示例使用本地执行模式来简化操作.

Note: This example uses local execution mode to simplify things.

import os
import pyspark

创建上下文后,我可以在笔记本电脑的控制台中看到这些下载内容,并添加这些内容使我从完全崩溃到出现Bad Request错误.

I can see in the console for the notebook these download after creating the context, and adding these took me from being completely broken, to getting the Bad Request error.

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell'

conf = pyspark.SparkConf('local[1]')
sc = pyspark.SparkContext(conf=conf)
sql = pyspark.SQLContext(sc)

对于aws config,我尝试了以下方法,并且只使用了上面的conf,并且做了与我下面的操作相同的conf.set(spark.hadoop.fs.<config_string>, <config_value>)模式,除了这样做是在conf上设置了值在创建Spark上下文之前.

For the aws config, I tried both the below method and by just using the above conf, and doing conf.set(spark.hadoop.fs.<config_string>, <config_value>) pattern equivalent to what I do below, except doing it this was I set the values on conf before creating the spark context.

hadoop_conf = sc._jsc.hadoopConfiguration()

hadoop_conf.set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoop_conf.set("fs.s3a.endpoint", "s3.us-east-2.amazonaws.com")
hadoop_conf.set("fs.s3a.access.key", access_id)
hadoop_conf.set("fs.s3a.secret.key", access_key)

要注意的一件事是,我还尝试了s3-us-east-2.amazonaws.comus-east-2的替代端点.

One thing to note, is that I also tried an alternative endpoint for us-east-2 of s3-us-east-2.amazonaws.com.

然后我从s3中读取了一些实木复合地板数据.

I then read some parquet data off of s3.

df = sql.read.parquet('s3a://bucket-name/parquet-data-name')
df.limit(10).toPandas()

同样,将EC2实例移至us-east-1,并注释掉端点配置后,以上内容对我有用.在我看来,出于某种原因未使用端点配置.

Again, after moving the EC2 instance to us-east-1, and commenting out the endpoint config, the above works for me. To me, it seems like endpoint config isn't being used for some reason.

推荐答案

us-east-2是V4 auth S3实例,因此,正如您尝试的那样,必须设置fs.s3a.endpoint值.

us-east-2 is a V4 auth S3 instance so, as you attemped, the fs.s3a.endpoint value must be set.

如果未将其拾取,则假定您设置的配置不是用于访问存储桶的配置.知道即使配置发生更改,Hadoop也会通过URI缓存文件系统实例.首次尝试访问文件系统的问题是配置,即使该配置缺少身份验证详细信息也是如此.

if it's not being picked up then assume the config you are setting isn't the one being used to access the bucket. Know that Hadoop caches filesystem instances by URI, even when the config changes. The first attempt to access a filesystem fixes, the config, even when its lacking in auth details.

一些战术

  1. 将值设置为spark-defaults
  2. 使用刚刚创建的配置,尝试通过调用Filesystem.get(new URI("s3a://bucket-name/parquet-data-name", myConf)显式加载文件系统,将返回具有该配置的存储桶(除非它已经存在).我不知道如何在.py中拨打电话.
  3. 将属性"fs.s3a.impl.disable.cache"设置为true,以在获取命令之前绕过缓存
  1. set the value is spark-defaults
  2. using the config you've just created, try to explicitly load the filesystem via a call to Filesystem.get(new URI("s3a://bucket-name/parquet-data-name", myConf) will return the bucket with that config (unless it's already there). I don't know how to make that call in .py though.
  3. set the property "fs.s3a.impl.disable.cache" to true to bypass the cache before the get command

为BadAuth错误添加更多的诊断信息,以及Wiki页面,是S3A阶段III列出的一项功能.如果您要添加它以及测试,我可以对其进行审核并将其放入

Adding more more diagnostics on BadAuth errors, along with a wiki page, is a feature listed for S3A phase III. If you were to add it, along with a test, I can review it and get it in

这篇关于如何在aws us-east-2上将S3a与Spark 2.1.0一起使用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆