如何处理Spark中的执行程序内存和驱动程序内存? [英] How to deal with executor memory and driver memory in Spark?

查看:103
本文介绍了如何处理Spark中的执行程序内存和驱动程序内存?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对在Spark中处理执行程序内存和驱动程序内存感到困惑.

I am confused about dealing with executor memory and driver memory in Spark.

我的环境设置如下:

  • 内存128 G,16个CPU的9个虚拟机
  • Centos
  • Hadoop 2.5.0-cdh5.2.0
  • 火花1.1.0

输入数据信息:

    来自HDFS的
  • 3.5 GB数据文件

为了进行简单开发,我使用spark-submit在独立集群模式(8个工作程序,20个内核,45.3 G内存)下执行了我的Python代码.现在,我想设置执行程序内存或驱动程序内存以进行性能调整.

For simple development, I executed my Python code in standalone cluster mode (8 workers, 20 cores, 45.3 G memory) with spark-submit. Now I would like to set executor memory or driver memory for performance tuning.

Spark文档中的定义执行者的记忆是

From the Spark documentation, the definition for executor memory is

每个执行程序进程要使用的内存量,格式与JVM内存字符串(例如512m,2g)相同.

Amount of memory to use per executor process, in the same format as JVM memory strings (e.g. 512m, 2g).

驱动程序内存如何?

推荐答案

您需要分配给驱动程序的内存取决于作业.

The memory you need to assign to the driver depends on the job.

如果作业完全基于转换,并且终止于诸如rdd.saveAsTextFile,rdd.saveToCassandra之类的分布式输出操作,则驱动程序的内存需求将非常低.几乎没有100 MB的存储空间.该驱动程序还负责传递文件和收集指标,但不参与数据处理.

If the job is based purely on transformations and terminates on some distributed output action like rdd.saveAsTextFile, rdd.saveToCassandra, ... then the memory needs of the driver will be very low. Few 100's of MB will do. The driver is also responsible of delivering files and collecting metrics, but not be involved in data processing.

如果工作需要驾驶员参与计算,例如一些需要实现结果并在下一次迭代中广播结果的ML算法,然后您的工作将取决于通过驱动程序传递的数据量. .collect.taketakeSample之类的操作会将数据传递给驱动程序,因此,驱动程序需要足够的内存来分配此类数据.

If the job requires the driver to participate in the computation, like e.g. some ML algo that needs to materialize results and broadcast them on the next iteration, then your job becomes dependent of the amount of data passing through the driver. Operations like .collect,.take and takeSample deliver data to the driver and hence, the driver needs enough memory to allocate such data.

例如如果群集中的rdd为3GB,并调用val myresultArray = rdd.collect,则驱动程序中将需要3GB的内存来保存该数据,并为第一段中提到的功能提供一些额外的空间.

e.g. If you have an rdd of 3GB in the cluster and call val myresultArray = rdd.collect, then you will need 3GB of memory in the driver to hold that data plus some extra room for the functions mentioned in the first paragraph.

这篇关于如何处理Spark中的执行程序内存和驱动程序内存?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆