增加Spark工人核心 [英] Increase the Spark workers cores
问题描述
我已经在master和2个worker上安装了Spark.每个工作人员的原始核心编号是8.启动主服务器时,工作人员可以正常工作,没有任何问题,但是问题是在Spark GUI中,每个工作人员仅分配了2个内核.
I have installed Spark on master and 2 workers. The original core number per worker is 8. When I start the master, the workers are work properly without any problem, but the problem is in Spark GUI each worker has only 2 cores assigned.
请问,如何增加每个工作人员使用8个内核的内核数量?
Kindly, how can I increase the number of the cores in which each worker works with 8 cores?
推荐答案
控制每个执行者核心的设置是 spark.executor.cores
.请参阅文档.可以通过 spark-submit
cmd参数或在 spark-defaults.conf
中进行设置.该文件通常位于/etc/spark/conf
(ymmv)中.您可以使用 find/-type f -name spark-defaults.conf
The setting which controls cores per executor is spark.executor.cores
. See doc. It can be set either via spark-submit
cmd argument or in spark-defaults.conf
. The file is usually located in /etc/spark/conf
(ymmv). YOu can search for the conf file with find / -type f -name spark-defaults.conf
spark.executor.cores 8
但是,该设置不能保证每个执行者将始终获得所有可用的内核.这取决于您的工作量.
However the setting does not guarantee that each executor will always get all the available cores. This depends on your workload.
如果您在数据帧或rdd上安排任务,spark将为该数据帧的每个分区运行一个并行任务.任务将安排给执行者(单独的jvm),执行者可以在每个内核的jvm线程中并行运行多个任务.
If you schedule tasks on a dataframe or rdd, spark will run a parallel task for each partition of the dataframe. A task will be scheduled to an executor (separate jvm) and the executor can run multiple tasks in parallel in jvm threads on each core.
此外,执行者不一定必须在单独的工作人员上运行.如果有足够的内存,则两个执行程序可以共享一个工作程序节点.
Also an exeucutor will not necessarily run on a separate worker. If there is enough memory, 2 executors can share a worker node.
要使用所有核心,您的情况下的设置应如下所示:
In order to use all the cores the setup in your case could look as follows:
假设每个节点上有10 GB的内存
given you have 10 gig of memory on each node
spark.default.parallelism 14
spark.executor.instances 2
spark.executor.cores 7
spark.executor.memory 9g
将内存设置为9g可以确保将每个执行程序分配给一个单独的节点.每个执行器将具有7个可用核心.每个数据帧操作将被调度到14个并发任务,这些任务将x 7分配给每个执行器.您也可以重新分区数据框,而不用设置 default.parallelism
.操作系统只剩下一个核心和1gig的内存.
Setting memory to 9g will make sure, each executor is assigned to a separate node. Each executor will have 7 cores available. And each dataframe operation will be scheduled to 14 concurrent tasks, which will be distributed x 7 to each executor. You can also repartition a dataframe, instead of setting default.parallelism
. One core and 1gig of memory is left for the operating system.
这篇关于增加Spark工人核心的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!