PySpark:彻底清理检查点 [英] PySpark: fully cleaning checkpoints

查看:48
本文介绍了PySpark:彻底清理检查点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据文档可以告诉 Spark 跟踪 超出范围" 检查点 - 那些不再需要的 - 并从磁盘中清除它们.

According the documentation is possible to tell Spark to keep track of "out of scope" checkpoints - those that are not needed anymore - and clean them from disk.

SparkSession.builder
  ...
  .config("spark.cleaner.referenceTracking.cleanCheckpoints", "true")
  .getOrCreate()

显然是这样做的,但问题是最后一个检查点的 rdd 永远不会被删除.

Apparently it does so but the problem, however, is that the last checkpointed rdds are never deleted.

  • 是否缺少执行所有清理的任何配置?
  • 如果没有:有没有办法获取为特定应用程序创建的临时文件夹的名称,以便我可以通过编程方式将其删除?IE.从 SparkContext 获取 0c514fb8-498c-4455-b147-aff242bd7381 与获取 applicationId
  • 的方式相同
  • Is there any configuration I am missing to perform all cleanse?
  • If there isn't: Is there any way to get the name of the temporary folder created for a particular application so I can programatically delete it? I.e. Get 0c514fb8-498c-4455-b147-aff242bd7381 from SparkContext the same way you can get the applicationId

推荐答案

我知道它的老问题,但最近我在探索 checkpoint 并遇到了类似的问题.想分享研究结果.

I know its old question but recently i was exploring on checkpoint and had similar problems. Would like to share the findings.

问题:是否缺少执行所有清理的任何配置?

Question :Is there any configuration I am missing to perform all cleanse?

设置 spark.cleaner.referenceTracking.cleanCheckpoints=true 有时会起作用,但很难依赖它.官方文档说通过设置这个属性

Setting spark.cleaner.referenceTracking.cleanCheckpoints=true is working sometime but its hard to rely on it. official document says that by setting this property

如果引用超出范围,则清理检查点文件

clean checkpoint files if the reference is out of scope

我不知道这到底是什么意思,因为我的理解是一旦 spark session/context 停止,它应该清理它.

I don't know what exactly it means because my understanding is once spark session/context is stopped it should clean it.

不过,我找到了您下面问题的答案

However, I found a answer to your below question

如果没有:有没有办法得到临时的名字为特定应用程序创建的文件夹,以便我可以以编程方式删除它?IE.从 0c514fb8-498c-4455-b147-aff242bd7381 获取SparkContext 与获取 applicationId 的方式相同

If there isn't: Is there any way to get the name of the temporary folder created for a particular application so I can programatically delete it? I.e. Get 0c514fb8-498c-4455-b147-aff242bd7381 from SparkContext the same way you can get the applicationId

,我们可以得到如下checkpointed目录:

Yes, We can get the checkpointed directory like below:

斯卡拉:

//Set directory
scala> spark.sparkContext.setCheckpointDir("hdfs:///tmp/checkpoint/")

scala> spark.sparkContext.getCheckpointDir.get
res3: String = hdfs://<name-node:port>/tmp/checkpoint/625034b3-c6f1-4ab2-9524-e48dfde589c3

//It gives String so we can use org.apache.hadoop.fs to delete path 

PySpark:

// Set directory
>>> spark.sparkContext.setCheckpointDir('hdfs:///tmp/checkpoint')
>>> t = sc._jsc.sc().getCheckpointDir().get()
>>> t 
u'hdfs://<name-node:port>/tmp/checkpoint/dc99b595-f8fa-4a08-a109-23643e2325ca'

// notice 'u' at the start which means It returns unicode object
// Below are the steps to get hadoop file system object and delete

>>> fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jsc.hadoopConfiguration())
fs.exists(sc._jvm.org.apache.hadoop.fs.Path(str(t)))
True

>>> fs.delete(sc._jvm.org.apache.hadoop.fs.Path(str(t)))
True

这篇关于PySpark:彻底清理检查点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆