Spark:整个数据集中在一个执行器中 [英] Spark: Entire dataset concentrated in one executor
问题描述
我正在运行一个Spark作业,其中有3个文件,每个文件大小为100MB,由于某种原因,我的spark UI将所有数据集集中到2个执行程序中,这使该作业运行了19个小时并且仍在运行. 下面是我的火花配置. spark 2.3是使用的版本.
I am running a spark job with 3 files each of 100MB size, for some reason my spark UI shows all dataset concentrated into 2 executors.This is making the job run for 19 hrs and still running. Below is my spark configuration . spark 2.3 is the version used.
spark2-submit --class org.mySparkDriver \
--master yarn-cluster \
--deploy-mode cluster \
--driver-memory 8g \
--num-executors 100 \
--conf spark.default.parallelism=40 \
--conf spark.yarn.executor.memoryOverhead=6000mb \
--conf spark.dynamicAllocation.executorIdleTimeout=6000s \
--conf spark.executor.cores=3 \
--conf spark.executor.memory=8G \
我尝试在有效的代码内重新分区,因为这使文件进入20个分区(我使用rdd.repartition(20)).但是我为什么要重新分区,我相信在脚本中指定spark.default.parallelism = 40应该可以让spark将输入文件划分为40个执行器,并在40个执行器中处理该文件.
I tried repartitioning inside the code which works , as this makes the file go into 20 partitions (i used rdd.repartition(20)). But why should I repartition , i believe specifying spark.default.parallelism=40 in the script should let spark divide the input file to 40 executors and process the file in 40 executors.
任何人都可以帮忙.
谢谢, Neethu
Thanks, Neethu
推荐答案
如果是的话,我假设您正在YARN中运行作业,可以检查以下属性.
I am assuming you're running your jobs in YARN if yes, you can check following properties.
yarn.scheduler.maximum-allocation-mb
yarn.nodemanager.resource.memory-mb
yarn.scheduler.maximum-allocation-vcores
yarn.nodemanager.resource.cpu-vcores
在YARN中,这些属性会影响可以基于spark.executor.cores, spark.executor.memory
属性值(以及执行程序内存开销)在NodeManager中实例化的容器数量
In YARN these properties would affect number of containers that can be instantiated in a NodeManager based on spark.executor.cores, spark.executor.memory
property values (along with executor memory overhead)
例如,如果一个群集具有10个节点(RAM:16 GB,核心:6),并设置了以下纱线属性
For example, if a cluster with 10 nodes (RAM : 16 GB, cores : 6) and set with following yarn properties
yarn.scheduler.maximum-allocation-mb=10GB
yarn.nodemanager.resource.memory-mb=10GB
yarn.scheduler.maximum-allocation-vcores=4
yarn.nodemanager.resource.cpu-vcores=4
然后具有Spark属性spark.executor.cores=2, spark.executor.memory=4GB
的您可以期望2
执行器/节点,因此总计您将获得19个执行器+ 1个用于驱动程序的容器
Then with spark properties spark.executor.cores=2, spark.executor.memory=4GB
you can expect 2
Executors/Node so total you'll get 19 executors + 1 container for Driver
如果spark属性为spark.executor.cores=3, spark.executor.memory=8GB
,则您将获得9个执行器(仅1个执行器/节点)+ 1个用于驱动程序的容器
If the spark properties are spark.executor.cores=3, spark.executor.memory=8GB
then you will get 9 Executor (only 1 Executor/Node) + 1 container for Driver
您可以参考链接了解更多信息
you can refer to link for more details
希望这会有所帮助
这篇关于Spark:整个数据集中在一个执行器中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!