为什么SparkContext.parallelize使用驱动程序的内存? [英] Why does SparkContext.parallelize use memory of the driver?

查看:112
本文介绍了为什么SparkContext.parallelize使用驱动程序的内存?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

现在,我必须在pyspark( Spark 2.1.0 )中使用 sc.parallelize()创建一个并行化的集合.

我的驱动程序中的集合很大.当我对其进行并行处理时,我发现它在主节点上占用了大量内存.

在我将集合并行化到每个工作节点之后,似乎仍将集合保留在主节点的 spark 中. 这是我的代码示例:

# my python code
sc = SparkContext()
a = [1.0] * 1000000000
rdd_a = sc.parallelize(a, 1000000)
sum = rdd_a.reduce(lambda x, y: x+y)

我尝试过

del a

销毁它,但是没有用. java进程的火花仍在使用大量内存.

创建 rdd_a 后,如何销毁 a 以释放主节点的内存?

谢谢!

解决方案

主人的工作是协调工人并在完成当前任务后给工人新任务.为此,主数据库需要跟踪给定计算所需完成的所有任务.

现在,如果输入是文件,则任务看起来就像从X到Y读取文件F".但是因为输入是从内存开始的,所以任务看起来像1,000个数字.而且鉴于主机需要跟踪所有1,000,000个任务,这将变得非常大.

Now I have to create a parallelized collection using sc.parallelize() in pyspark (Spark 2.1.0).

The collection in my driver program is big. when I parallelize it, I found it takes up a lot of memory in master node.

It seems that the collection is still being kept in spark's memory of the master node after I parallelize it to each worker node. Here's an example of my code:

# my python code
sc = SparkContext()
a = [1.0] * 1000000000
rdd_a = sc.parallelize(a, 1000000)
sum = rdd_a.reduce(lambda x, y: x+y)

I've tried

del a

to destroy it, but it didn't work. The spark which is a java process is still using a lot of memory.

After I create rdd_a, how can I destroy a to free the master node's memory?

Thanks!

解决方案

The job of the master is to coordinate the workers and to give a worker a new task once it has completed its current task. In order to do that, the master needs to keep track of all of the tasks that need to be done for a given calculation.

Now, if the input were a file, the task would simply look like "read file F from X to Y". But because the input was in memory to begin with, the task looks like 1,000 numbers. And given the master needs to keep track of all 1,000,000 tasks, that gets quite large.

这篇关于为什么SparkContext.parallelize使用驱动程序的内存?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆