如何在aws us-east-2上将S3a与Spark 2.1.0一起使用? [英] How do you use s3a with spark 2.1.0 on aws us-east-2?

查看：98 发布时间：2020/8/23 6:36:38 hadoop apache-spark amazon-s3 pyspark parquet

本文介绍了如何在aws us-east-2上将S3a与Spark 2.1.0一起使用?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

背景

我一直在努力为自己设置一个灵活的设置，以便在具有docker swarm模式的aws上使用spark.我一直在使用的docker映像配置为使用最新的spark，当时在Hadoop 2.7.3中为2.1.0，可在

I have been working on getting a flexible setup for myself to use spark on aws with docker swarm mode. The docker image I have been using is configured to use the latest spark, which at the time is 2.1.0 with Hadoop 2.7.3, and is available at jupyter/pyspark-notebook.

这是有效的，我一直在测试我计划使用的各种连接路径.我遇到的问题是围绕与s3交互的正确方法的不确定性.我已经按照有关如何使用s3a协议和s3n协议为spark sws 3上的数据提供Spark依赖关系的方法进行了跟踪.

This is working, and I have been just going through to test out the various connectivity paths that I plan to use. The issue I came across is with the uncertainty around the correct way to interact with s3. I have followed the trail on how to provide the dependencies for spark to connect to data on aws s3 using the s3a protocol, vs s3n protocol.

我终于遇到了 hadoop aws指南，并认为我正在遵循如何提供配置.但是，我仍然收到400 Bad Request错误，如

I finally came across the hadoop aws guide and thought I was following how to provide the configuration. However, I was still receiving the 400 Bad Request error, as seen in this question that describes how to fix it by defining the endpoint, which I had already done.

在us-east-2上，我最终偏离了标准配置，这使我不确定jar文件是否存在问题.为了消除区域问题，我将内容设置在常规的us-east-1区域上，最终能够与s3a连接.因此，我将问题范围缩小到了该地区，但以为我正在做在其他地区进行操作所需的一切.

I ended up being too far off the standard configuration by being on us-east-2, making me uncertain if I had a problem with the jar files. To eliminate the region issue, I set things back up on the regular us-east-1 region, and I was able to finally connect with s3a. So I have narrowed down the problem to the region, but thought I was doing everything required to operate on the other region.

问题

在spark中使用hadoop的配置变量以使用us-east-2的正确方法是什么?

What is the correct way to use the configuration variables for hadoop in spark to use us-east-2?

注意:本示例使用本地执行模式来简化操作.

Note: This example uses local execution mode to simplify things.

import os
import pyspark

创建上下文后，我可以在笔记本电脑的控制台中看到这些下载内容，并添加这些内容使我从完全崩溃到出现Bad Request错误.

I can see in the console for the notebook these download after creating the context, and adding these took me from being completely broken, to getting the Bad Request error.

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell'

conf = pyspark.SparkConf('local[1]')
sc = pyspark.SparkContext(conf=conf)
sql = pyspark.SQLContext(sc)

对于aws config，我尝试了以下方法，并且只使用了上面的conf，并且做了与我下面的操作相同的conf.set(spark.hadoop.fs.<config_string>, <config_value>)模式，除了这样做是在conf上设置了值在创建Spark上下文之前.

For the aws config, I tried both the below method and by just using the above conf, and doing conf.set(spark.hadoop.fs.<config_string>, <config_value>) pattern equivalent to what I do below, except doing it this was I set the values on conf before creating the spark context.

hadoop_conf = sc._jsc.hadoopConfiguration()

hadoop_conf.set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoop_conf.set("fs.s3a.endpoint", "s3.us-east-2.amazonaws.com")
hadoop_conf.set("fs.s3a.access.key", access_id)
hadoop_conf.set("fs.s3a.secret.key", access_key)

要注意的一件事是，我还尝试了s3-us-east-2.amazonaws.com的us-east-2的替代端点.

One thing to note, is that I also tried an alternative endpoint for us-east-2 of s3-us-east-2.amazonaws.com.

然后我从s3中读取了一些实木复合地板数据.

I then read some parquet data off of s3.

df = sql.read.parquet('s3a://bucket-name/parquet-data-name')
df.limit(10).toPandas()

同样，将EC2实例移至us-east-1，并注释掉端点配置后，以上内容对我有用.在我看来，出于某种原因未使用端点配置.

Again, after moving the EC2 instance to us-east-1, and commenting out the endpoint config, the above works for me. To me, it seems like endpoint config isn't being used for some reason.

如何在aws us-east-2上将S3a与Spark 2.1.0一起使用? [英] How do you use s3a with spark 2.1.0 on aws us-east-2?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何在aws us-east-2上将S3a与Spark 2.1.0一起使用? [英] How do you use s3a with spark 2.1.0 on aws us-east-2?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭