如何配置Spark/Glue以避免在成功执行Glue作业后创建空的$ _folder_ $ [英] How to configure Spark / Glue to avoid creation of empty $_folder_$ after Glue job successful execution

查看：130 发布时间：2021/4/3 19:45:14 amazon-web-services aws-glue aws-glue-spark aws-glue-workflow

本文介绍了如何配置Spark/Glue以避免在成功执行Glue作业后创建空的$ _folder_ $的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个简单的胶水etl作业，由胶水工作流程触发.它从搜寻器表中删除重复数据，并将结果写回到S3存储桶中.作业成功完成.但是，产生火花的空文件夹会生成"$ 文件夹 $"保持在s3中.它在层次结构中看起来不太好，并引起混乱.成功完成工作后，是否可以通过任何方式配置spark或胶粘上下文以隐藏/删除这些文件夹?

I have a simple glue etl job which is triggered by Glue workflow. It drop duplicates data from a crawler table and writes back the result into a S3 bucket. The job is completed successfully . However the empty folders that spark generates "$folder$" remain in s3. It does not look nice in the hierarchy and causes confusion. Is there any way to configure spark or glue context to hide/remove these folders after successful completion of the job?

--------------------- S3图像---------------------

---------------------S3 image ---------------------

推荐答案

好吧，经过几天的测试，终于找到了解决方案.在粘贴代码之前，让我总结一下我发现的内容...

Ok finally after few days of testing I found the solution. Before pasting the code let me summarize what I have found ...

这些$ folder $是通过Hadoop创建的.ApacheHadoop在S3存储桶中创建文件夹时会创建这些文件. Source1 它们实际上是目录标记，为路径+/.源2
要更改行为，您需要在Spark上下文中更改Hadoop S3写入配置.阅读此和此和此处

Those $folder$ are created via Hadoop .Apache Hadoop creates these files when to create a folder in an S3 bucket. Source1 They are actually directory markers as path + /. Source 2
To change the behavior , you need to change the Hadoop S3 write configuration in Spark context. Read this and this and this
Read about S3 , S3a and S3n here and here
Thanks to @stevel 's comment here

现在的解决方案是在Spark上下文Hadoop中设置以下配置.

Now the solution is to set the following configuration in Spark context Hadoop.

sc = SparkContext()
hadoop_conf = sc._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")

为避免创建SUCCESS文件，您还需要设置以下配置: hadoop_conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs"，"false")

To avoid creation of SUCCESS files you need to set the following configuration as well : hadoop_conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")

确保使用S3 URI写入s3存储桶.例如:

Make sure you use the S3 URI for writing to s3 bucket. ex:

myDF.write.mode("overwrite").parquet('s3://XXX/YY',partitionBy['DDD'])

这篇关于如何配置Spark/Glue以避免在成功执行Glue作业后创建空的$ _folder_ $的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何配置Spark/Glue以避免在成功执行Glue作业后创建空的$ _folder_ $ [英] How to configure Spark / Glue to avoid creation of empty $_folder_$ after Glue job successful execution

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何配置Spark/Glue以避免在成功执行Glue作业后创建空的$ _folder_ $ [英] How to configure Spark / Glue to avoid creation of empty $_folder_$ after Glue job successful execution

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭