从DataFrame写入新文件时出现文件已存在错误 [英] File already exists error writing new files from dataframe

查看:0
本文介绍了从DataFrame写入新文件时出现文件已存在错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在EMR Spark上,通过数据帧向S3写入RDD[String]

rddString
  .toDF()
  .coalesce(16)
  .write
  .option("compression", "gzip")
  .mode(SaveMode.Overwrite)
  .json(s"s3n://my-bucket/some/new/path")

保存模式为Overwrites3n://my-bucket/some/new/path尚不存在。

我始终收到IOException: File already exists

org.apache.spark.SparkException: Job aborted due to stage failure: Task 15 in stage 55.0 failed 4 times, most recent failure: Lost task 15.3 in stage 55.0 (TID 8441, ip-172-31-17-30.us-west-2.compute.internal, executor 3): org.apache.spark.SparkException: Task failed while writing rows
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:270)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:189)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:188)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:108)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: File already exists:s3n://my-bucket/some/new/path/part-00015-03a0c001-fc99-4055-9be5-68a1fb0cf6d3-c000.json.gz
    at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.create(S3NativeFileSystem.java:625)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:932)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:913)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:810)
    at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.create(EmrFileSystem.java:176)
    at org.apache.spark.sql.execution.datasources.CodecStreams$.createOutputStream(CodecStreams.scala:81)
    at org.apache.spark.sql.execution.datasources.CodecStreams$.createOutputStreamWriter(CodecStreams.scala:92)
    at org.apache.spark.sql.execution.datasources.json.JsonOutputWriter.<init>(JsonFileFormat.scala:140)
    at org.apache.spark.sql.execution.datasources.json.JsonFileFormat$$anon$1.newInstance(JsonFileFormat.scala:80)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:303)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:312)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:254)
    at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1371)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:259)
    ... 8 more

Spark v2.2.1、EMR v5.12.0

在抛出异常之前,将文件写入目标。但是,我不知道它们是否完整。

推荐答案

我在使用胶水作业运行EMR时遇到了类似的问题。简而言之,这通常不是你工作失败的真正根本原因。火花任务可能由于其他原因而失败。在对原始失败进行重试后,它最终抛出此"IOException:文件已存在"。

找到并解决真正的根本原因,它也会消失。

在我的案例中,CloudWatch错误日志中报告的错误如下:

: org.apache.spark.SparkException: Job aborted.
at ...
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: File already exists:s3://personal-tests/xdqian/zappos_triplet_loss/output_cache_test/part-00003-8eaa7c78-e227-4476-b96d-4300e7350bc7-c000.csv

我不知道,但当我检查日志时,发现异常如下:

18/12/05 06:14:15 ERROR Utils: Aborting task
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/mnt/yarn/usercache/root/appcache/application_1543990079218_0001/container_1543990079218_0001_01_000101/pyspark.zip/pyspark/worker.py", line 177, in main
process()
File "/mnt/yarn/usercache/root/appcache/application_1543990079218_0001/container_1543990079218_0001_01_000101/pyspark.zip/pyspark/worker.py", line 172, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/mnt/yarn/usercache/root/appcache/application_1543990079218_0001/container_1543990079218_0001_01_000101/pyspark.zip/pyspark/serializers.py", line 268, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "/mnt/yarn/usercache/root/appcache/application_1543990079218_0001/container_1543990079218_0001_01_000001/GoldenGardensGluePythonScripts.zip/golden_gardens_glue_python_scripts/job.py", line 62, in <lambda>
TypeError: 'NoneType' object has no attribute '__getitem__'
在我解决了这个NoneType错误后,"文件已存在"异常终于消失了。我读到了其他一些材料(抱歉,我无法再追踪到它),"文件已存在"错误总是由任务失败和由于其他问题(在我的例子中是NoneType)重试引起的。我期待Executor任务创建一个文件并逐行输出数据。它可能由于NoneType错误而在第34行失败,并被中止,而文件仍然存在,具有前33行。 据说失败的任务将被重试4次。当任务被重试时,它会在最开始的时候通过先前的运行找到现有的文件。 因此,根本原因实际上被记录为Loggs,ErrorLogs中有"文件已存在"异常,因为这是作业终止前的最后一个异常。 覆盖模式在这里没有帮助,因为它只执行开始时的检查,而不是此边缘情况的控制标志。

这篇关于从DataFrame写入新文件时出现文件已存在错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆