如何覆盖spark中的输出目录 [英] How to overwrite the output directory in spark

查看：30 发布时间：2021/11/12 5:30:13 apache-spark

本文介绍了如何覆盖spark中的输出目录的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个火花流应用程序，它每分钟生成一个数据集.我需要保存/覆盖处理数据的结果.

I have a spark streaming application which produces a dataset for every minute. I need to save/overwrite the results of the processed data.

当我试图覆盖数据集时 org.apache.hadoop.mapred.FileAlreadyExistsException 停止执行.

When I tried to overwrite the dataset org.apache.hadoop.mapred.FileAlreadyExistsException stops the execution.

我设置了 Spark 属性 set("spark.files.overwrite","true") ，但没有运气.

I set the Spark property set("spark.files.overwrite","true") , but there is no luck.

如何覆盖或预删除 Spark 中的文件?

How to overwrite or Predelete the files from spark?

推荐答案

UPDATE:建议使用 Dataframes，加上类似 ... .write.mode(SaveMode.Overwrite) ....

UPDATE: Suggest using Dataframes, plus something like ... .write.mode(SaveMode.Overwrite) ....

方便的皮条客:

implicit class PimpedStringRDD(rdd: RDD[String]) {
    def write(p: String)(implicit ss: SparkSession): Unit = {
      import ss.implicits._
      rdd.toDF().as[String].write.mode(SaveMode.Overwrite).text(p)
    }
  }

对于旧版本尝试

yourSparkConf.set("spark.hadoop.validateOutputSpecs", "false")
val sc = SparkContext(yourSparkConf)

在 1.1.0 中，您可以使用带有 --conf 标志的 spark-submit 脚本来设置 conf 设置.

In 1.1.0 you can set conf settings using the spark-submit script with the --conf flag.

警告(旧版本):根据@piggybox 的说法，Spark 中存在一个错误，它只会覆盖写入 part- 文件所需的文件，而不会删除任何其他文件.

WARNING (older versions): According to @piggybox there is a bug in Spark where it will only overwrite files it needs to to write it's part- files, any other files will be left unremoved.

这篇关于如何覆盖spark中的输出目录的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何覆盖spark中的输出目录 [英] How to overwrite the output directory in spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何覆盖spark中的输出目录 [英] How to overwrite the output directory in spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭