S3A:在S3:在Spark EMR中工作时失败 [英] S3A: fails while S3: works in Spark EMR

查看:101
本文介绍了S3A:在S3:在Spark EMR中工作时失败的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在将Spark与EMR 5.5.0一起使用.如果我使用s3://... URL将简单文件写入s3,则可以正常书写.但是,如果我使用s3a://...地址,则该地址将失败,并显示Service: Amazon S3; Status Code: 403; Error Code: AccessDenied

I'm using EMR 5.5.0 with Spark. If I write a simple file to s3 using an s3://... URL it writes fine. But if I use an s3a://... address, it fails with Service: Amazon S3; Status Code: 403; Error Code: AccessDenied

使用AWS命令行,我可以对要写入的路径中的任何文件进行cp,mv和rm.但是从火花中,s3a在put命令上失败.

Using the AWS command line I'm able to cp, mv, and rm any file in the path I'm writing to. But from spark, s3a fails on the put command.

我们启用了服务器端加密,我知道spark知道是因为s3 URL有效.有任何想法吗?

We have Server Side Encryption Enabled, and I know spark knows because the s3 URLs work. Any ideas?

失败的PUT DEBUG日志此处.也许要注意的重要一点是,我正在执行rdd.saveAsTextFile(path),但是put命令说它试图写入/my-bucket/tmp/carlos/testWrite/4/_temporary/0/,它只能在拼花地板中执行?不确定该细节是否相关,但我想提一下.

Failed PUT DEBUG logs here. Maybe its important to note, I'm doing an rdd.saveAsTextFile(path) but the put command says its trying to write to /my-bucket/tmp/carlos/testWrite/4/_temporary/0/ which it should only do in parquet? Not sure if that detail is relevant but thought I would mention.

推荐答案

s3a是Apache Hadoop中主动维护的S3客户端.许多年前,AWS从Apache s3n://客户端分叉了自己的客户端. (大概)已经大量改造了他们的产品.

s3a is the actively maintained S3 client in Apache Hadoop. AWS forked their own client off from the Apache s3n:// client many years ago & (presumably) have massively reworked theirs.

他们可以读取和写入相同的数据,但是EMR的某些位希望文件系统客户端中有其他方法,只有EMR s3支持...您不能安全地使用s3a.

They can read and write the same data, but some bits of EMR expect extra methods in the filesystem client which only EMR s3 supports...you cannot safely use s3a.

还有一个原始的ASF s3://客户端,该客户端与所有其他客户端都不兼容,但它是用于将Hadoop与S3连接的第一个代码,远早于EMR是来自亚马逊的产品.

There's also the original ASF s3:// client which is incompatible with everything else, but was the first code used to connect Hadoop with S3, way before EMR was a product from amazon.

哪个更好?截至2017年8月,S3A可能在ORC和Parquet等列格式的主动读取IO上更快.带有emrfs的EMR S3在弹性和一致性方面可能具有优势.但是开源的ASF S3A客户端正在努力解决这些问题

Which is better? S3A is probably, as of Aug 2017, faster on aggressive read IO of columnar formats like ORC and Parquet. EMR S3, with emrfs probably has the edge in terms of resilience and consistency. But the open source ASF S3A client is moving to address those

这篇关于S3A:在S3:在Spark EMR中工作时失败的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆