写入JSON文件时舞台失败时触发Spark FileAlreadyExistsException [英] Spark FileAlreadyExistsException on stage failure while writing a JSON file

查看:153
本文介绍了写入JSON文件时舞台失败时触发Spark FileAlreadyExistsException的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将数据帧以JSON格式写入s3位置.但是,每当执行程序任务失败并且Spark重试该阶段时,它将抛出FileAlreadyExistsException.

I am trying to write a dataframe to an s3 location in JSON format. But whenever an executor task fails and Spark retries the stage it throws a FileAlreadyExistsException.

之前曾问过一个类似的问题,但它使用单独的地址处理ORC文件spark conf,但无法解决我的问题.

A similar question has been asked before but it addresses ORC files with a separate spark conf and doesn't address my issue.

这是我的代码:

val result = spark.sql(query_that_OOMs_executor)
result.write.mode(SaveMode.Overwrite).json(s3_path)

在spark用户界面上,执行程序上的错误提示

From the spark UI, the error on the executor says

ExecutorLostFailure (executor 302 exited caused by one of the running tasks) 
Reason: Container killed by YARN for exceeding memory limits. 4.5 GB of 4.5 GB physical memory used. 
Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.

但是驱动程序堆栈跟踪显示

But the driver stack trace says

Job aborted due to stage failure: Task 1344 in stage 2.0 failed 4 times, most recent failure: Lost task 1344.3 in stage 2.0 (TID 25797, executor.ec2.com, executor 217): org.apache.hadoop.fs.FileAlreadyExistsException: s3://prod-bucket/application_1590774027047/-650323473_1594243391573/part-01344-dc971661-93ef-4abc-8380-c000.json already exists

我如何做到这一点,以便Spark尝试覆盖此JSON文件?这样,一旦所有4次重试均失败,我将获得驱动程序的真正原因.我已经将模式设置为覆盖,因此无济于事.

How do I make it so that spark tries to overwrite this JSON file? This way I'll get the real reason on the driver once all 4 retries fail. I've already set the mode to overwrite so that's not helping.

推荐答案

发生此问题是因为DirectFileOutputCommitter的一个基本问题,默认情况下在此使用.

This issue happened because of a fundamental issue with the DirectFileOutputCommitter which was being used here by default.

这里有两件事:执行程序OOM,然后是重试中的FileAlreadyExistsException,导致重试(以及SQL查询)失败.

There are two things here: the executor OOM and then the FileAlreadyExistsException on retries causing the retries (and hence the SQL query) to fail.

原因: DirectFileOutputCommitter将尝试在单个任务尝试中将输出文件写入最终输出路径.为此,请写入暂存目录,然后重命名为最终路径并删除原始路径.这很糟糕,容易出现不一致和错误,因此Spark不推荐这样做.

Reason: The DirectFileOutputCommitter will try to write the output files in a single task attempt to the final output path. It’ll do that by writing to a staging directory and then renaming to the final path and deleting the original. This is bad and prone to inconsistencies and errors and is also not recommended by Spark.

相反,我使用了 Netflix S3提交者,它可以通过多部分方式进行.它将首先在本地磁盘上写入文件,然后在任务提交过程中将每个文件分段上传到S3,但不会立即可见,然后在作业提交过程中(仅当所有任务完成时才会发生)成功完成,这样是安全的操作),本地磁盘数据将被删除并且上传将完成(现在数据将在S3上可见).这样可以防止失败的任务直接将内容写入S3,从而避免在重试时使用FileAlreadyExistsException.

Instead, I used the Netflix S3 committer which would do this in a multipart fashion. It’ll first write files on the local disk, then during task commit each of these would be uploaded to S3 in multi-part but won’t be immediately visible, then during the job commit (which will happen only when all tasks have completed successfully so is a safe operation) the local disk data will be deleted and the upload will be complete (now data will be visible on S3). This prevents failed tasks directly writing stuff to S3 and hence avoid the FileAlreadyExistsException on retrying.

现在可以使用执行程序OOM了-我的查询仍然在执行它们,但是重试成功,并且在以前使用DirectFileOutputCommitter也失败了.

Now for the executor OOMs — they are still happening for my query, but the retries succeed which were also failing before with the DirectFileOutputCommitter.

要解决这个问题,我基本上做到了

To solve this, I basically did

set spark.sql.sources.outputCommitterClass=com.netflix.s3.S3DirectoryOutputCommitter;

这篇关于写入JSON文件时舞台失败时触发Spark FileAlreadyExistsException的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆