S3AFileSystem-当前缀是文件并且是目录树的一部分时,FileAlreadyExistsException [英] S3AFileSystem - FileAlreadyExistsException when prefix is a file and part of a directory tree

查看:241
本文介绍了S3AFileSystem-当前缀是文件并且是目录树的一部分时,FileAlreadyExistsException的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们正在使用aws-java-sdk-1.7.4.jar hadoop-aws-2.7.5.jar运行Apache Spark作业,以将镶木地板文件写入S3存储桶.

We are running Apache Spark jobs with aws-java-sdk-1.7.4.jar hadoop-aws-2.7.5.jar to write parquet files to an S3 bucket.

我们在s3(d7是文本文件)中具有键"s3://mybucket/d1/d2/d3/d4/d5/d6/d7".我们还有键's3://mybucket/d1/d2/d3/d4/d5/d6/d7/d8/d9/part_dt=20180615/a.parquet'(a.parquet是文件)

We have the key 's3://mybucket/d1/d2/d3/d4/d5/d6/d7' in s3 (d7 being a text file). We also have keys 's3://mybucket/d1/d2/d3/d4/d5/d6/d7/d8/d9/part_dt=20180615/a.parquet' (a.parquet being a file)

当我们运行spark作业以在's3://mybucket/d1/d2/d3/d4/d5/d6/d7/d8/d9/part_dt = 20180616/'下写入b.parquet文件时在S3中创建"s3://mybucket/d1/d2/d3/d4/d5/d6/d7/d8/d9/part_dt=20180616/b.parquet",我们将收到以下错误消息

When we run a spark job to write b.parquet file under 's3://mybucket/d1/d2/d3/d4/d5/d6/d7/d8/d9/part_dt=20180616/' (ie would like to have 's3://mybucket/d1/d2/d3/d4/d5/d6/d7/d8/d9/part_dt=20180616/b.parquet' get created in s3) we get the below error

org.apache.hadoop.fs.FileAlreadyExistsException: Can't make directory for path 's3a://mybucket/d1/d2/d3/d4/d5/d6/d7' since it is a file.
at org.apache.hadoop.fs.s3a.S3AFileSystem.mkdirs(S3AFileSystem.java:861)
at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1881)

推荐答案

中所述HADOOP-15542 .您不能在普通" FS中的目录下放置文件;您不会将它们放在S3A连接器中,至少在它进行足够的尽职调查的地方.

As discussed in HADOOP-15542. you can't have files under directories in a "normal" FS; you don't get them in the S3A connector, at least where it does enough due diligence.

它只会混淆每一个树遍历算法,重命名,删除以及扫描文件的任何内容.这将包括火花分割逻辑.您尝试创建的新目录树可能对呼叫者不可见. (您可以通过创建文本文件,将文本文件放置到适当位置来进行测试,看看会发生什么情况)

It just confuses every single tree walking algorithm, renames, deletes, anything which scans for files. This will include the spark partitioning logic. That new directory tree you are trying to create would probably appear invisible to callers. (you could test this by creating it, doing the PUT of that text file into place, see what happens)

我们尝试在 Hadoop文件系统规范,其中包括定义非常明显"的东西,没有人愿意写下来或为其编写测试,例如

We try to define what an FS should do in The Hadoop Filesystem Specification, including defining things "so obvious" that nobody bothered to write them down or write tests for, such as

  • 只有目录可以有子目录
  • 所有孩子必须有父母
  • 仅文件可以包含数据(例外:ReiserFS)
  • 文件的长度与它们说的一样长(这就是S3A不支持客户端加密BTW的原因).

我们常常会发现一些我们忘记考虑的新事物,哪些真正的"文件系统强制执行,而哪些对象存储却不执行.然后,我们添加测试,尽最大努力维护这种隐喻,除非性能影响使其无法使用.然后,我们选择不修复问题,希望没人注意.通常,由于使用hadoop/hive/spark空间中的数据的人对文件系统的工作具有相同的先入之见,因此这些歧义实际上不会在生产中引起问题.

Every so often we discover some new thing we forgot to consider, which "real" filesystems enforce out the box, but which object stores don't. Then we add tests, try our best to maintain the metaphor except when the performance impact would make it unusable. Then we opt not to fix things and hope nobody notices. Generally, because people working with data in the hadoop/hive/spark space have those same preconceptions of what a filesystem does, those ambiguities don't actually cause problems in production.

当然除了最终的一致性外,这就是为什么您不应该在没有一致性服务(S3Guard,一致的EMRFS)或为此世界设计的提交协议(S3A Committer,databricks DBIO)的情况下从Spark直接将数据写入S3

Except of course eventual consistency, which is why you shouldn't be writing data straight to S3 from spark without a consistency service (S3Guard, consistent EMRFS), or a commit protocol designed for this world (S3A Committer, databricks DBIO).

这篇关于S3AFileSystem-当前缀是文件并且是目录树的一部分时,FileAlreadyExistsException的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆