您如何在 aws us-east-2 上使用带有 spark 2.1.0 的 s3a? [英] How do you use s3a with spark 2.1.0 on aws us-east-2?

查看：29 发布时间：2021/12/15 18:59:29 hadoop apache-spark amazon-s3 pyspark parquet

本文介绍了您如何在 aws us-east-2 上使用带有 spark 2.1.0 的 s3a?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

背景

我一直致力于为自己获得一个灵活的设置，以便在 docker swarm 模式下在 aws 上使用 spark.我一直在使用的 docker 镜像配置为使用最新的 spark，当时是 2.1.0 和 Hadoop 2.7.3，可在 jupyter/pyspark-notebook.

I have been working on getting a flexible setup for myself to use spark on aws with docker swarm mode. The docker image I have been using is configured to use the latest spark, which at the time is 2.1.0 with Hadoop 2.7.3, and is available at jupyter/pyspark-notebook.

这是有效的，我一直在测试我计划使用的各种连接路径.我遇到的问题是与 s3 交互的正确方式的不确定性.我遵循了有关如何使用 s3a 协议与 s3n 协议为 spark 提供依赖项以连接到 aws s3 上的数据的线索.

This is working, and I have been just going through to test out the various connectivity paths that I plan to use. The issue I came across is with the uncertainty around the correct way to interact with s3. I have followed the trail on how to provide the dependencies for spark to connect to data on aws s3 using the s3a protocol, vs s3n protocol.

我终于遇到了hadoop aws指导并认为我正在遵循如何提供配置.但是，我仍然收到 400 Bad Request 错误，如这个问题描述了如何通过定义我已经完成的端点来修复它.

I finally came across the hadoop aws guide and thought I was following how to provide the configuration. However, I was still receiving the 400 Bad Request error, as seen in this question that describes how to fix it by defining the endpoint, which I had already done.

我最终在 us-east-2 上离标准配置太远了，这让我不确定我的 jar 文件是否有问题.为了消除区域问题，我在常规的 us-east-1 区域上进行了设置，最终能够与 s3a 连接.所以我已经将问题缩小到该区域，但我认为我正在做其他区域操作所需的一切.

I ended up being too far off the standard configuration by being on us-east-2, making me uncertain if I had a problem with the jar files. To eliminate the region issue, I set things back up on the regular us-east-1 region, and I was able to finally connect with s3a. So I have narrowed down the problem to the region, but thought I was doing everything required to operate on the other region.

问题

在spark中使用hadoop的配置变量来使用us-east-2的正确方法是什么?

What is the correct way to use the configuration variables for hadoop in spark to use us-east-2?

注意:这个例子使用本地执行模式来简化事情.

Note: This example uses local execution mode to simplify things.

import os
import pyspark

我可以在创建上下文后在笔记本的控制台中看到这些下载，添加这些使我从完全崩溃到收到错误请求错误.

I can see in the console for the notebook these download after creating the context, and adding these took me from being completely broken, to getting the Bad Request error.

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell'

conf = pyspark.SparkConf('local[1]')
sc = pyspark.SparkContext(conf=conf)
sql = pyspark.SQLContext(sc)

对于 aws 配置，我尝试了以下方法并仅使用上述 conf，并执行 conf.set(spark.hadoop.fs.<config_string>, <config_value>) 模式相当于我在下面所做的，除了这样做是我在创建 spark 上下文之前在 conf 上设置了值.

For the aws config, I tried both the below method and by just using the above conf, and doing conf.set(spark.hadoop.fs.<config_string>, <config_value>) pattern equivalent to what I do below, except doing it this was I set the values on conf before creating the spark context.

hadoop_conf = sc._jsc.hadoopConfiguration()

hadoop_conf.set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoop_conf.set("fs.s3a.endpoint", "s3.us-east-2.amazonaws.com")
hadoop_conf.set("fs.s3a.access.key", access_id)
hadoop_conf.set("fs.s3a.secret.key", access_key)

需要注意的一点是，我还尝试了 s3-us-east-2.amazonaws.com 的 us-east-2 的替代端点.

One thing to note, is that I also tried an alternative endpoint for us-east-2 of s3-us-east-2.amazonaws.com.

然后我从 s3 中读取了一些镶木地板数据.

I then read some parquet data off of s3.

df = sql.read.parquet('s3a://bucket-name/parquet-data-name')
df.limit(10).toPandas()

同样，在将 EC2 实例移动到 us-east-1 并注释掉端点配置后，以上对我有用.对我来说，似乎由于某种原因没有使用端点配置.

Again, after moving the EC2 instance to us-east-1, and commenting out the endpoint config, the above works for me. To me, it seems like endpoint config isn't being used for some reason.

您如何在 aws us-east-2 上使用带有 spark 2.1.0 的 s3a? [英] How do you use s3a with spark 2.1.0 on aws us-east-2?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

您如何在 aws us-east-2 上使用带有 spark 2.1.0 的 s3a? [英] How do you use s3a with spark 2.1.0 on aws us-east-2?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭