为了释放资源,如何删除PySpark中的RDD? [英] How to delete an RDD in PySpark for the purpose of releasing resources?

查看:1113
本文介绍了为了释放资源,如何删除PySpark中的RDD?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我拥有不再需要的RDD,如何将其从内存中删除? 以下内容足以完成此任务:

If I have an RDD that I no longer need, how do I delete it from memory? Would the following be enough to get this done:

del thisRDD

谢谢!

推荐答案

否,del thisRDD还不够,它只会删除指向RDD的指针.您应该调用thisRDD.unpersist()删除缓存的数据.

No, del thisRDD is not enough, it would just delete the pointer to the RDD. You should call thisRDD.unpersist() to remove the cached data.

为您提供信息,Spark使用了惰性计算模型,这意味着在您运行以下代码时:

For you information, Spark uses a model of lazy computations, which means that when you run this code:

>>> thisRDD = sc.parallelize(xrange(10),2).cache()

您将不会真正缓存任何数据,在RDD执行计划中只会将其标记为"要缓存".您可以通过以下方式进行检查:

you won't have any data cached really, it would be only marked as 'to be cached' in the RDD execution plan. You can check it this way:

>>> print thisRDD.toDebugString()
(2) PythonRDD[6] at RDD at PythonRDD.scala:43 [Memory Serialized 1x Replicated]
 |  ParallelCollectionRDD[5] at parallelize at PythonRDD.scala:364 [Memory Serialized 1x Replicated]

但是,当您至少在此RDD上调用一次操作时,该操作将被缓存:

But when you call an action on top of this RDD at least once, it would become cached:

>>> thisRDD.count()
10
>>> print thisRDD.toDebugString()
(2) PythonRDD[6] at RDD at PythonRDD.scala:43 [Memory Serialized 1x Replicated]
 |       CachedPartitions: 2; MemorySize: 174.0 B; TachyonSize: 0.0 B; DiskSize: 0.0 B
 |  ParallelCollectionRDD[5] at parallelize at PythonRDD.scala:364 [Memory Serialized 1x Replicated]

您可以使用地址http://<driver_node>:4040/storage轻松检查Spark UI中的持久数据和持久性级别.您会看到del thisRDD不会改变此RDD的持久性,但是thisRDD.unpersist()会使其持久化,而您仍然可以在代码中使用thisRDD(尽管它不再持久存储在内存中,并且会每次查询时都会重新计算)

You can easily check the persisted data and the level of persistence in the Spark UI using the address http://<driver_node>:4040/storage. You would see there that del thisRDD won't change the persistence of this RDD, but thisRDD.unpersist() would unpersist it, while you still would be able to use thisRDD in your code (while it won't persist in memory anymore and would be recomputed each time it is queried)

这篇关于为了释放资源,如何删除PySpark中的RDD?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆