S3上带有美元符号的垃圾Spark输出文件 [英] Junk Spark output file on S3 with dollar signs

查看：160 发布时间：2020/8/23 8:33:15 apache-spark amazon-s3 pyspark

本文介绍了S3上带有美元符号的垃圾Spark输出文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个简单的spark作业，该作业从s3读取文件，占用5个文件并在s3中写回. 我看到的是，在s3中，输出目录"旁边总是有一个附加文件，称为output_ $ folder $.

I have a simple spark job that reads a file from s3, takes five and writes back in s3. What I see is that there is always additional file in s3, next to my output "directory", which is called output_$folder$.

是什么?如何防止火花产生? 这是一些代码来显示我在做什么...

What is it? How I can prevent spark from creating it? Here is some code to show what I am doing...

x = spark.sparkContext.textFile("s3n://.../0000_part_00")
five = x.take(5)
five = spark.sparkContext.parallelize(five)
five.repartition(1).saveAsTextFile("s3n://prod.casumo.stu/dimensions/output/")

工作后，我有一个名为output的s3目录"，其中包含结果，还有另一个名为output_ $ folder $的s3对象，我不知道它是什么.

After the job I have s3 "directory" called output which contains results and another s3 object called output_$folder$ which I don't know what it is.

推荐答案

好吧，看来我知道了它是什么. 这是某种标记文件，可能用于确定S3目录对象是否存在. 我如何得出这个结论? 首先，我发现此链接显示了

Ok, it seems I found out what it is. It is some kind of marker file, probably used for determining if the S3 directory object exists or not. How I reached this conclusion? First, I found this link that shows the source of

org.apache.hadoop.fs.s3native.NativeS3FileSystem#mkdir

方法: http://apache-spark-user-list.1001560.n3.nabble.com/S3-Extra-folder-files-for-every-directory-node-td15078.html

然后，我在其他源存储库中搜索以查看是否要查找该方法的不同版本.我没有.

Then I googled other source repositories to see if I am going to find different version of the method. I didn't.

最后，我做了一个实验，并在删除了s3输出目录对象但保留了output_ $ folder $文件之后，重新运行了相同的spark作业.作业失败，说输出目录已经存在.

At the end, I did an experiment and rerun the same spark job after I removed the s3 output directory object but left output_$folder$ file. Job failed saying that output directory already exists.

我的结论是，这是hadoop的方法，它可以知道s3中是否存在具有给定名称的目录，而我将不得不使用该目录.

My conclusion, this is hadoop's way to know if there is a directory in s3 with given name and I will have to live with that.

当我从本地开发机(即笔记本电脑)运行作业时，上述所有情况都会发生.如果我通过aws数据管道运行相同的作业，则不会创建output_ $ folder $.

All the above happens when I run the job from my local, dev machine - i.e. laptop. If I run the same job from a aws data pipeline, output_$folder$ does not get created.

这篇关于S3上带有美元符号的垃圾Spark输出文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

S3上带有美元符号的垃圾Spark输出文件 [英] Junk Spark output file on S3 with dollar signs

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

S3上带有美元符号的垃圾Spark输出文件 [英] Junk Spark output file on S3 with dollar signs

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭