Spark:整个数据集中在一个执行器中 [英] Spark: Entire dataset concentrated in one executor

查看:188
本文介绍了Spark:整个数据集中在一个执行器中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在运行一个Spark作业,其中有3个文件,每个文件大小为100MB,由于某种原因,我的spark UI将所有数据集集中到2个执行程序中,这使该作业运行了19个小时并且仍在运行. 下面是我的火花配置. spark 2.3是使用的版本.

I am running a spark job with 3 files each of 100MB size, for some reason my spark UI shows all dataset concentrated into 2 executors.This is making the job run for 19 hrs and still running. Below is my spark configuration . spark 2.3 is the version used.

spark2-submit --class org.mySparkDriver \
    --master yarn-cluster \
    --deploy-mode cluster \
    --driver-memory 8g \
    --num-executors 100 \
    --conf spark.default.parallelism=40 \
    --conf spark.yarn.executor.memoryOverhead=6000mb \
    --conf spark.dynamicAllocation.executorIdleTimeout=6000s \
    --conf spark.executor.cores=3 \
    --conf spark.executor.memory=8G \

我尝试在有效的代码内重新分区,因为这使文件进入20个分区(我使用rdd.repartition(20)).但是我为什么要重新分区,我相信在脚本中指定spark.default.parallelism = 40应该可以让spark将输入文件划分为40个执行器,并在40个执行器中处理该文件.

I tried repartitioning inside the code which works , as this makes the file go into 20 partitions (i used rdd.repartition(20)). But why should I repartition , i believe specifying spark.default.parallelism=40 in the script should let spark divide the input file to 40 executors and process the file in 40 executors.

任何人都可以帮忙.

谢谢, Neethu

Thanks, Neethu

推荐答案

如果是的话,我假设您正在YARN中运行作业,可以检查以下属性.

I am assuming you're running your jobs in YARN if yes, you can check following properties.

yarn.scheduler.maximum-allocation-mb
yarn.nodemanager.resource.memory-mb
yarn.scheduler.maximum-allocation-vcores
yarn.nodemanager.resource.cpu-vcores

在YARN中,这些属性会影响可以基于spark.executor.cores, spark.executor.memory属性值(以及执行程序内存开销)在NodeManager中实例化的容器数量

In YARN these properties would affect number of containers that can be instantiated in a NodeManager based on spark.executor.cores, spark.executor.memory property values (along with executor memory overhead)

例如,如果一个群集具有10个节点(RAM:16 GB,核心:6),并设置了以下纱线属性

For example, if a cluster with 10 nodes (RAM : 16 GB, cores : 6) and set with following yarn properties

yarn.scheduler.maximum-allocation-mb=10GB 
yarn.nodemanager.resource.memory-mb=10GB
yarn.scheduler.maximum-allocation-vcores=4
yarn.nodemanager.resource.cpu-vcores=4

然后具有Spark属性spark.executor.cores=2, spark.executor.memory=4GB的您可以期望2执行器/节点,因此总计您将获得19个执行器+ 1个用于驱动程序的容器

Then with spark properties spark.executor.cores=2, spark.executor.memory=4GB you can expect 2 Executors/Node so total you'll get 19 executors + 1 container for Driver

如果spark属性为spark.executor.cores=3, spark.executor.memory=8GB,则您将获得9个执行器(仅1个执行器/节点)+ 1个用于驱动程序的容器

If the spark properties are spark.executor.cores=3, spark.executor.memory=8GB then you will get 9 Executor (only 1 Executor/Node) + 1 container for Driver

您可以参考链接了解更多信息

you can refer to link for more details

希望这会有所帮助

这篇关于Spark:整个数据集中在一个执行器中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆