Apache Spark 不删除临时目录 [英] Apache Spark does not delete temporary directories

查看:47
本文介绍了Apache Spark 不删除临时目录的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

一个spark程序完成后,temp目录下还剩下3个临时目录.目录名是这样的:spark-2e389487-40cc-4a82-a5c7-353c0feefbb7

After a spark program completes, there are 3 temporary directories remain in the temp directory. The directory names are like this: spark-2e389487-40cc-4a82-a5c7-353c0feefbb7

目录为空.

当 Spark 程序在 Windows 上运行时,一个活泼的 DLL 文件也会保留在临时目录中.文件名是这样的:snappy-1.0.4.1-6e117df4-97b6-4d69-bf9d-71c4a627940c-snappyjava

And when the Spark program runs on Windows, a snappy DLL file also remains in the temp directory. The file name is like this: snappy-1.0.4.1-6e117df4-97b6-4d69-bf9d-71c4a627940c-snappyjava

每次运行 Spark 程序时都会创建它们.所以文件和目录的数量不断增加.

They are created every time the Spark program runs. So the number of files and directories keeps growing.

怎样才能让它们被删除?

How can let them be deleted?

Spark 版本是 1.3.1,Hadoop 2.6.

Spark version is 1.3.1 with Hadoop 2.6.

更新

我已经跟踪了 spark 源代码.

I've traced the spark source code.

创建 3 个 'temp' 目录的模块方法如下:

The module methods that create the 3 'temp' directories are as follows:

  • DiskBlockManager.createLocalDirs
  • HttpFileServer.initialize
  • SparkEnv.sparkFilesDir

他们(最终)调用 Utils.getOrCreateLocalRootDirs,然后调用 Utils.createDirectory,它故意不将目录标记为自动删除.

They (eventually) call Utils.getOrCreateLocalRootDirs and then Utils.createDirectory, which intentionally does NOT mark the directory for automatic deletion.

createDirectory 方法的注释说:目录保证是新创建,未标记为自动删除."

The comment of createDirectory method says: "The directory is guaranteed to be newly created, and is not marked for automatic deletion."

我不知道为什么它们没有被标记.这真的是故意的吗?

I don't know why they are not marked. Is this really intentional?

推荐答案

存在三个 SPARK_WORKER_OPTS 来支持 worker 应用程序文件夹清理,复制到这里以供进一步参考:来自 Spark 文档

Three SPARK_WORKER_OPTS exists to support the worker application folder cleanup, copied here for further reference: from Spark Doc

  • spark.worker.cleanup.enabled,默认值为false,启用定期清理worker/应用程序目录.请注意,这仅影响独立模式,因为 YARN 的工作方式不同.仅清理已停止应用程序的目录.

  • spark.worker.cleanup.enabled, default value is false, Enable periodic cleanup of worker / application directories. Note that this only affects standalone mode, as YARN works differently. Only the directories of stopped applications are cleaned up.

spark.worker.cleanup.interval,默认为1800,即30分钟,控制worker清理本地机器上旧应用程序工作目录的时间间隔,单位为秒.

spark.worker.cleanup.interval, default is 1800, i.e. 30 minutes, Controls the interval, in seconds, at which the worker cleans up old application work dirs on the local machine.

spark.worker.cleanup.appDataTtl,默认为 7*24*3600(7 天),在每个 worker 上保留应用程序工作目录的秒数.这是一个生存时间,应该取决于您拥有的可用磁盘空间量.应用程序日志和 jar 文件下载到每个应用程序工作目录.随着时间的推移,工作目录会迅速填满磁盘空间,尤其是在您非常频繁地运行作业的情况下.

spark.worker.cleanup.appDataTtl, default is 7*24*3600 (7 days), The number of seconds to retain application work directories on each worker. This is a Time To Live and should depend on the amount of available disk space you have. Application logs and jars are downloaded to each application work dir. Over time, the work dirs can quickly fill up disk space, especially if you run jobs very frequently.

这篇关于Apache Spark 不删除临时目录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆