从pyspark手动调用Spark的垃圾回收 [英] Manually calling spark's garbage collection from pyspark

查看:43
本文介绍了从pyspark手动调用Spark的垃圾回收的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经在本地模式下使用pyspark 1.5在4核16GB机器上的大约300万条记录x 15列的所有字符串上运行了工作流.我注意到,如果我在没有首先重新启动spark的情况下再次运行相同的工作流,则内存用完,并且出现内存不足异常.

I have been running a workflow on some 3 Million records x 15 columns all strings on my 4 cores 16GB machine using pyspark 1.5 in local mode. I have noticed that if I run the same workflow again without first restarting spark, memory runs out and I get Out of Memory Exceptions.

由于我所有的缓存总计约为1 GB,因此我认为问题出在垃圾收集上.我能够通过调用以下命令来手动运行python垃圾收集器:

Since all my caches sum up to about 1 GB I thought that the problem lies in the garbage collection. I was able to run the python garbage collector manually by calling:

import gc
collected = gc.collect()
print "Garbage collector: collected %d objects." % collected

这有所帮助.

我已经根据

I have played with the settings of spark's GC according to this article, and have tried to compress the RDD and to change the serializer to Kyro. This had slowed down the processing and did not help much with the memory.

由于我确切地知道何时有空闲的cpu周期来调用GC,因此可以帮助我了解如何在JVM中手动调用它.

Since I know exactly when I have spare cpu cycles to call the GC, it could help my situation to know how to call it manually in the JVM.

推荐答案

我相信这将触发JVM中的GC(提示):

I believe this will trigger a GC (hint) in the JVM:

spark.sparkContext._jvm.System.gc()

另请参见:如何在Java中强制垃圾回收?

和: Java:您如何真正使用JVMTI的ForceGargabeCollection强制使用GC?

这篇关于从pyspark手动调用Spark的垃圾回收的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆