阶段失败时产生Spark FileAlreadyExistsException [英] Spark FileAlreadyExistsException on Stage Failure

查看:76
本文介绍了阶段失败时产生Spark FileAlreadyExistsException的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在重新分区后将数据帧写入s3位置.但是,只要写入阶段失败并且Spark重试该阶段,就会抛出FileAlreadyExistsException.

I am trying to write a dataframe to s3 location after re-partitioning. But whenever the write stage fails and Spark retry the stage it throws FileAlreadyExistsException.

当我重新提交工作时,如果spark在一次尝试中完成了该阶段,就可以很好地工作.

When I re-submit the job it works fine if spark completes the stage in one try.

下面是我的代码段

df.repartition(<some-value>).write.format("orc").option("compression", "zlib").mode("Overwrite").save(path)

我相信Spark应该在重试之前从失败的阶段删除文件.我知道,如果将重试设置为零,但火花阶段可能会失败,这将不是一个正确的解决方案.

I believe Spark should remove files from the failed stage before retry. I understand this will be solved if we set retry to zero but the spark stage is expected to fail and that would not be a proper solution.

以下是错误:

Job aborted due to stage failure: Task 0 in stage 6.1 failed 4 times, most recent failure: Lost task 0.3 in stage 6.1 (TID 740, ip-address, executor 170): org.apache.hadoop.fs.FileAlreadyExistsException: File already exists:s3://<bucket-name>/<path-to-object>/part-00000-c3c40a57-7a50-41da-9ce2-555753cab63a-c000.zlib.orc
    at com.amazon.ws.emr.hadoop.fs.s3.upload.plan.RegularUploadPlanner.checkExistenceIfNotOverwriting(RegularUploadPlanner.java:36)
    at com.amazon.ws.emr.hadoop.fs.s3.upload.plan.RegularUploadPlanner.plan(RegularUploadPlanner.java:30)
    at com.amazon.ws.emr.hadoop.fs.s3.upload.plan.UploadPlannerChain.plan(UploadPlannerChain.java:37)
    at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.create(S3NativeFileSystem.java:601)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:932)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:913)
    at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.create(EmrFileSystem.java:242)
    at org.apache.orc.impl.PhysicalFsWriter.<init>(PhysicalFsWriter.java:95)
    at org.apache.orc.impl.WriterImpl.<init>(WriterImpl.java:170)
    at org.apache.orc.OrcFile.createWriter(OrcFile.java:843)
    at org.apache.orc.mapreduce.OrcOutputFormat.getRecordWriter(OrcOutputFormat.java:50)
    at org.apache.spark.sql.execution.datasources.orc.OrcOutputWriter.<init>(OrcOutputWriter.scala:43)
    at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:121)
    at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:120)
    at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:108)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:233)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:168)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:121)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:

我正在使用带有EMR的Spark 2.4,请提出解决方案.

I am using Spark 2.4 with EMR, Please suggest the solution.

请注意,该问题与覆盖模式无关,我已经在使用它.正如问题标题所暗示的那样,问题在于阶段失败时的剩余文件.可能是Spark UI清除了它.

Edit 1: Please note the issue is not related to overwrite mode, I am already using it. As the question title suggests, the issue is with leftover files in case of stage failure. May be the Spark UI clears it.

推荐答案

在Spark Config中设置spark.hadoop.orc.overwrite.output.file=true.

Set spark.hadoop.orc.overwrite.output.file=true in your Spark Config.

您可以在此处找到有关此配置的更多详细信息-

You can find more details on this config here - OrcConf.java

这篇关于阶段失败时产生Spark FileAlreadyExistsException的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆