S3上带有美元符号的垃圾Spark输出文件 [英] Junk Spark output file on S3 with dollar signs

查看:160
本文介绍了S3上带有美元符号的垃圾Spark输出文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个简单的spark作业,该作业从s3读取文件,占用5个文件并在s3中写回. 我看到的是,在s3中,输出目录"旁边总是有一个附加文件,称为output_ $ folder $.

I have a simple spark job that reads a file from s3, takes five and writes back in s3. What I see is that there is always additional file in s3, next to my output "directory", which is called output_$folder$.

是什么?如何防止火花产生? 这是一些代码来显示我在做什么...

What is it? How I can prevent spark from creating it? Here is some code to show what I am doing...

x = spark.sparkContext.textFile("s3n://.../0000_part_00")
five = x.take(5)
five = spark.sparkContext.parallelize(five)
five.repartition(1).saveAsTextFile("s3n://prod.casumo.stu/dimensions/output/")

工作后,我有一个名为output的s3目录",其中包含结果,还有另一个名为output_ $ folder $的s3对象,我不知道它是什么.

After the job I have s3 "directory" called output which contains results and another s3 object called output_$folder$ which I don't know what it is.

推荐答案

好吧,看来我知道了它是什么. 这是某种标记文件,可能用于确定S3目录对象是否存在. 我如何得出这个结论? 首先,我发现此链接显示了

Ok, it seems I found out what it is. It is some kind of marker file, probably used for determining if the S3 directory object exists or not. How I reached this conclusion? First, I found this link that shows the source of

org.apache.hadoop.fs.s3native.NativeS3FileSystem#mkdir

方法: http://apache-spark-user-list.1001560.n3.nabble.com/S3-Extra-folder-files-for-every-directory-node-td15078.html

然后,我在其他源存储库中搜索以查看是否要查找该方法的不同版本.我没有.

Then I googled other source repositories to see if I am going to find different version of the method. I didn't.

最后,我做了一个实验,并在删除了s3输出目录对象但保留了output_ $ folder $文件之后,重新运行了相同的spark作业.作业失败,说输出目录已经存在.

At the end, I did an experiment and rerun the same spark job after I removed the s3 output directory object but left output_$folder$ file. Job failed saying that output directory already exists.

我的结论是,这是hadoop的方法,它可以知道s3中是否存在具有给定名称的目录,而我将不得不使用该目录.

My conclusion, this is hadoop's way to know if there is a directory in s3 with given name and I will have to live with that.

当我从本地开发机(即笔记本电脑)运行作业时,上述所有情况都会发生.如果我通过aws数据管道运行相同的作业,则不会创建output_ $ folder $.

All the above happens when I run the job from my local, dev machine - i.e. laptop. If I run the same job from a aws data pipeline, output_$folder$ does not get created.

这篇关于S3上带有美元符号的垃圾Spark输出文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆