Spark FileAlreadyExistsException 写入 JSON 文件时阶段失败 [英] Spark FileAlreadyExistsException on stage failure while writing a JSON file

查看:46
本文介绍了Spark FileAlreadyExistsException 写入 JSON 文件时阶段失败的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试以 JSON 格式将数据帧写入 s3 位置.但是每当执行程序任务失败并且 Spark 重试该阶段时,它就会抛出一个 FileAlreadyExistsException.

I am trying to write a dataframe to an s3 location in JSON format. But whenever an executor task fails and Spark retries the stage it throws a FileAlreadyExistsException.

类似的问题之前已经被问过,但它用单独的spark conf 并没有解决我的问题.

A similar question has been asked before but it addresses ORC files with a separate spark conf and doesn't address my issue.

这是我的代码:

val result = spark.sql(query_that_OOMs_executor)
result.write.mode(SaveMode.Overwrite).json(s3_path)

从 spark UI,执行器上的错误说

From the spark UI, the error on the executor says

ExecutorLostFailure (executor 302 exited caused by one of the running tasks) 
Reason: Container killed by YARN for exceeding memory limits. 4.5 GB of 4.5 GB physical memory used. 
Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.

但是驱动程序堆栈跟踪说

But the driver stack trace says

Job aborted due to stage failure: Task 1344 in stage 2.0 failed 4 times, most recent failure: Lost task 1344.3 in stage 2.0 (TID 25797, executor.ec2.com, executor 217): org.apache.hadoop.fs.FileAlreadyExistsException: s3://prod-bucket/application_1590774027047/-650323473_1594243391573/part-01344-dc971661-93ef-4abc-8380-c000.json already exists

如何让 spark 尝试覆盖此 JSON 文件?这样,一旦所有 4 次重试都失败,我将了解驱动程序的真正原因.我已经将模式设置为覆盖,所以这没有帮助.

How do I make it so that spark tries to overwrite this JSON file? This way I'll get the real reason on the driver once all 4 retries fail. I've already set the mode to overwrite so that's not helping.

推荐答案

发生此问题的原因是此处默认使用的 DirectFileOutputCommitter 存在根本性问题.

This issue happened because of a fundamental issue with the DirectFileOutputCommitter which was being used here by default.

这里有两件事:执行程序 OOM 和重试时的 FileAlreadyExistsException 导致重试(以及 SQL 查询)失败.

There are two things here: the executor OOM and then the FileAlreadyExistsException on retries causing the retries (and hence the SQL query) to fail.

原因:DirectFileOutputCommitter 将尝试在单个任务中尝试将输出文件写入最终输出路径.它将通过写入暂存目录然后重命名为最终路径并删除原始路径来实现.这很糟糕,容易出现不一致和错误,Spark 也不推荐这样做.

Reason: The DirectFileOutputCommitter will try to write the output files in a single task attempt to the final output path. It’ll do that by writing to a staging directory and then renaming to the final path and deleting the original. This is bad and prone to inconsistencies and errors and is also not recommended by Spark.

相反,我使用了 Netflix S3 提交者,它会以多部分方式执行此操作.它首先将文件写入本地磁盘,然后在任务提交期间,每个文件都将分多部分上传到 S3,但不会立即可见,然后在作业提交期间(只有在所有任务完成后才会发生)成功所以是安全操作)本地磁盘数据将被删除,上传将完成(现在数据将在 S3 上可见).这可以防止失败的任务直接将内容写入 S3,从而避免重试时出现 FileAlreadyExistsException.

Instead, I used the Netflix S3 committer which would do this in a multipart fashion. It’ll first write files on the local disk, then during task commit each of these would be uploaded to S3 in multi-part but won’t be immediately visible, then during the job commit (which will happen only when all tasks have completed successfully so is a safe operation) the local disk data will be deleted and the upload will be complete (now data will be visible on S3). This prevents failed tasks directly writing stuff to S3 and hence avoid the FileAlreadyExistsException on retrying.

现在对于 executor OOMs - 它们仍然发生在我的查询中,但是重试成功了,之前使用 DirectFileOutputCommitter 也失败了.

Now for the executor OOMs — they are still happening for my query, but the retries succeed which were also failing before with the DirectFileOutputCommitter.

为了解决这个问题,我基本上做到了

To solve this, I basically did

set spark.sql.sources.outputCommitterClass=com.netflix.s3.S3DirectoryOutputCommitter;

这篇关于Spark FileAlreadyExistsException 写入 JSON 文件时阶段失败的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆