您如何在 aws us-east-2 上使用带有 spark 2.1.0 的 s3a? [英] How do you use s3a with spark 2.1.0 on aws us-east-2?

查看:29
本文介绍了您如何在 aws us-east-2 上使用带有 spark 2.1.0 的 s3a?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

背景

我一直致力于为自己获得一个灵活的设置,以便在 docker swarm 模式下在 aws 上使用 spark.我一直在使用的 docker 镜像配置为使用最新的 spark,当时是 2.1.0 和 Hadoop 2.7.3,可在 jupyter/pyspark-notebook.

I have been working on getting a flexible setup for myself to use spark on aws with docker swarm mode. The docker image I have been using is configured to use the latest spark, which at the time is 2.1.0 with Hadoop 2.7.3, and is available at jupyter/pyspark-notebook.

这是有效的,我一直在测试我计划使用的各种连接路径.我遇到的问题是与 s3 交互的正确方式的不确定性.我遵循了有关如何使用 s3a 协议与 s3n 协议为 spark 提供依赖项以连接到 aws s3 上的数据的线索.

This is working, and I have been just going through to test out the various connectivity paths that I plan to use. The issue I came across is with the uncertainty around the correct way to interact with s3. I have followed the trail on how to provide the dependencies for spark to connect to data on aws s3 using the s3a protocol, vs s3n protocol.

我终于遇到了hadoop aws指导 并认为我正在遵循如何提供配置.但是,我仍然收到 400 Bad Request 错误,如 这个问题 描述了如何通过定义我已经完成的端点来修复它.

I finally came across the hadoop aws guide and thought I was following how to provide the configuration. However, I was still receiving the 400 Bad Request error, as seen in this question that describes how to fix it by defining the endpoint, which I had already done.

我最终在 us-east-2 上离标准配置太远了,这让我不确定我的 jar 文件是否有问题.为了消除区域问题,我在常规的 us-east-1 区域上进行了设置,最终能够与 s3a 连接.所以我已经将问题缩小到该区域,但我认为我正在做其他区域操作所需的一切.

I ended up being too far off the standard configuration by being on us-east-2, making me uncertain if I had a problem with the jar files. To eliminate the region issue, I set things back up on the regular us-east-1 region, and I was able to finally connect with s3a. So I have narrowed down the problem to the region, but thought I was doing everything required to operate on the other region.

问题

在spark中使用hadoop的配置变量来使用us-east-2的正确方法是什么?

What is the correct way to use the configuration variables for hadoop in spark to use us-east-2?

注意:这个例子使用本地执行模式来简化事情.

Note: This example uses local execution mode to simplify things.

import os
import pyspark

我可以在创建上下文后在笔记本的控制台中看到这些下载,添加这些使我从完全崩溃到收到错误请求错误.

I can see in the console for the notebook these download after creating the context, and adding these took me from being completely broken, to getting the Bad Request error.

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell'

conf = pyspark.SparkConf('local[1]')
sc = pyspark.SparkContext(conf=conf)
sql = pyspark.SQLContext(sc)

对于 aws 配置,我尝试了以下方法并仅使用上述 conf,并执行 conf.set(spark.hadoop.fs.<config_string>, <config_value>) 模式相当于我在下面所做的,除了这样做是我在创建 spark 上下文之前在 conf 上设置了值.

For the aws config, I tried both the below method and by just using the above conf, and doing conf.set(spark.hadoop.fs.<config_string>, <config_value>) pattern equivalent to what I do below, except doing it this was I set the values on conf before creating the spark context.

hadoop_conf = sc._jsc.hadoopConfiguration()

hadoop_conf.set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoop_conf.set("fs.s3a.endpoint", "s3.us-east-2.amazonaws.com")
hadoop_conf.set("fs.s3a.access.key", access_id)
hadoop_conf.set("fs.s3a.secret.key", access_key)

需要注意的一点是,我还尝试了 s3-us-east-2.amazonaws.comus-east-2 的替代端点.

One thing to note, is that I also tried an alternative endpoint for us-east-2 of s3-us-east-2.amazonaws.com.

然后我从 s3 中读取了一些镶木地板数据.

I then read some parquet data off of s3.

df = sql.read.parquet('s3a://bucket-name/parquet-data-name')
df.limit(10).toPandas()

同样,在将 EC2 实例移动到 us-east-1 并注释掉端点配置后,以上对我有用.对我来说,似乎由于某种原因没有使用端点配置.

Again, after moving the EC2 instance to us-east-1, and commenting out the endpoint config, the above works for me. To me, it seems like endpoint config isn't being used for some reason.

推荐答案

us-east-2 是 V4 auth S3 实例,因此,在您尝试时,必须设置 fs.s3a.endpoint 值.

us-east-2 is a V4 auth S3 instance so, as you attemped, the fs.s3a.endpoint value must be set.

如果它没有被拾取,则假设您设置的配置不是用于访问存储桶的配置.知道 Hadoop 通过 URI 缓存文件系统实例,即使配置发生变化.第一次访问文件系统的尝试修复了配置,即使它缺乏身份验证细节.

if it's not being picked up then assume the config you are setting isn't the one being used to access the bucket. Know that Hadoop caches filesystem instances by URI, even when the config changes. The first attempt to access a filesystem fixes, the config, even when its lacking in auth details.

一些技巧

  1. 设置值为 spark-defaults
  2. 使用您刚刚创建的配置,尝试通过调用 Filesystem.get(new URI("s3a://bucket-name/parquet-data-name", myConf) 显式加载文件系统 将返回带有该配置的存储桶(除非它已经存在).不过我不知道如何在 .py 中进行调用.
  3. 在get命令之前将属性"fs.s3a.impl.disable.cache"设置为true以绕过缓存
  1. set the value is spark-defaults
  2. using the config you've just created, try to explicitly load the filesystem via a call to Filesystem.get(new URI("s3a://bucket-name/parquet-data-name", myConf) will return the bucket with that config (unless it's already there). I don't know how to make that call in .py though.
  3. set the property "fs.s3a.impl.disable.cache" to true to bypass the cache before the get command

添加更多关于 BadAuth 错误的诊断以及 wiki 页面,是 S3A 第三阶段列出的一项功能.如果你要添加它,连同测试,我可以查看它并获取它

Adding more more diagnostics on BadAuth errors, along with a wiki page, is a feature listed for S3A phase III. If you were to add it, along with a test, I can review it and get it in

这篇关于您如何在 aws us-east-2 上使用带有 spark 2.1.0 的 s3a?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆