Spark数据帧检查点清除 [英] Spark dataframe checkpoint cleanup

查看:67
本文介绍了Spark数据帧检查点清除的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在spark中有一个数据帧,其中已加载了Hive的整个分区,并且在对数据进行一些修改后,我需要打破血统才能覆盖同一分区.但是,当完成火花作业时,我剩下了HDFS上检查点的数据.为什么Spark不能自行清理,或者我缺少某些东西?

I have a dataframe in spark where an entire partition from Hive has been loaded and i need to break the lineage to overwrite the same partition after some modifications to the data. However, when the spark job is done i am left with the data from the checkpoint on the HDFS. Why do Spark not clean this up by itself or is there something i am missing?

spark.sparkContext.setCheckpointDir("/home/user/checkpoint/")
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")

val df = spark.table("db.my_table").filter(col("partition").equal(2))

// ... transformations to the dataframe

val checkpointDf = df.checkpoint()
checkpointDf.write.format("parquet").mode(SaveMode.Overwrite).insertInto("db.my_table")

在此之后,我将这个文件放在HDFS上:

After this i have this file on HDFS:

/home/user/checkpoint/214797f2-ce2e-4962-973d-8f215e5d5dd8/rdd-23/part-00000

每次我执行spark作业时,我都会得到一个带有新的唯一ID的新目录,其中包含数据框中存在的每个RDD的文件.

And for each time i run the spark job i just get a new directory with a new unique id containing files for each RDD that has been in the dataframes.

推荐答案

Spark具有用于清除检查点文件的隐式机制.

Spark has implicit mechanism for checkpoint files cleaning.

在spark-defaults.conf中添加此属性.

Add this property in spark-defaults.conf.

spark.cleaner.referenceTracking.cleanCheckpoints  true #Default is false

您可以在 Spark官方配置页面中找到有关Spark配置的更多信息

You can find more about Spark configuration in Spark official configuration page

如果要从HDFS删除检查点目录,则可以使用Python删除它,在脚本末尾,您可以使用以下命令

If you want to remove the checkpoint directory from HDFS you can remove it with Python, in the end of your script you could use this command rmtree.

此属性 spark.cleaner.referenceTracking.cleanCheckpoints true ,可以使清理程序清除检查点目录中的旧检查点文件.

This property spark.cleaner.referenceTracking.cleanCheckpoints as true, allows to cleaner to remove old checkpoint files inside the checkpoint directory.

这篇关于Spark数据帧检查点清除的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆