Spark:整个数据集中在一个执行器中 [英] Spark: Entire dataset concentrated in one executor

查看：188 发布时间：2020/11/22 3:02:11 apache-spark hadoop

本文介绍了Spark:整个数据集中在一个执行器中的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在运行一个Spark作业，其中有3个文件，每个文件大小为100MB，由于某种原因，我的spark UI将所有数据集集中到2个执行程序中，这使该作业运行了19个小时并且仍在运行. 下面是我的火花配置. spark 2.3是使用的版本.

I am running a spark job with 3 files each of 100MB size, for some reason my spark UI shows all dataset concentrated into 2 executors.This is making the job run for 19 hrs and still running. Below is my spark configuration . spark 2.3 is the version used.

spark2-submit --class org.mySparkDriver \
    --master yarn-cluster \
    --deploy-mode cluster \
    --driver-memory 8g \
    --num-executors 100 \
    --conf spark.default.parallelism=40 \
    --conf spark.yarn.executor.memoryOverhead=6000mb \
    --conf spark.dynamicAllocation.executorIdleTimeout=6000s \
    --conf spark.executor.cores=3 \
    --conf spark.executor.memory=8G \

我尝试在有效的代码内重新分区，因为这使文件进入20个分区(我使用rdd.repartition(20)).但是我为什么要重新分区，我相信在脚本中指定spark.default.parallelism = 40应该可以让spark将输入文件划分为40个执行器，并在40个执行器中处理该文件.

I tried repartitioning inside the code which works , as this makes the file go into 20 partitions (i used rdd.repartition(20)). But why should I repartition , i believe specifying spark.default.parallelism=40 in the script should let spark divide the input file to 40 executors and process the file in 40 executors.

任何人都可以帮忙.

谢谢， Neethu

Thanks, Neethu

推荐答案

如果是的话，我假设您正在YARN中运行作业，可以检查以下属性.

I am assuming you're running your jobs in YARN if yes, you can check following properties.

yarn.scheduler.maximum-allocation-mb
yarn.nodemanager.resource.memory-mb
yarn.scheduler.maximum-allocation-vcores
yarn.nodemanager.resource.cpu-vcores

在YARN中，这些属性会影响可以基于spark.executor.cores, spark.executor.memory属性值(以及执行程序内存开销)在NodeManager中实例化的容器数量

In YARN these properties would affect number of containers that can be instantiated in a NodeManager based on spark.executor.cores, spark.executor.memory property values (along with executor memory overhead)

例如，如果一个群集具有10个节点(RAM:16 GB，核心:6)，并设置了以下纱线属性

For example, if a cluster with 10 nodes (RAM : 16 GB, cores : 6) and set with following yarn properties

yarn.scheduler.maximum-allocation-mb=10GB 
yarn.nodemanager.resource.memory-mb=10GB
yarn.scheduler.maximum-allocation-vcores=4
yarn.nodemanager.resource.cpu-vcores=4

然后具有Spark属性spark.executor.cores=2, spark.executor.memory=4GB的您可以期望2执行器/节点，因此总计您将获得19个执行器+ 1个用于驱动程序的容器

Then with spark properties spark.executor.cores=2, spark.executor.memory=4GB you can expect 2 Executors/Node so total you'll get 19 executors + 1 container for Driver

如果spark属性为spark.executor.cores=3, spark.executor.memory=8GB，则您将获得9个执行器(仅1个执行器/节点)+ 1个用于驱动程序的容器

If the spark properties are spark.executor.cores=3, spark.executor.memory=8GB then you will get 9 Executor (only 1 Executor/Node) + 1 container for Driver

您可以参考链接了解更多信息

you can refer to link for more details

希望这会有所帮助

这篇关于Spark:整个数据集中在一个执行器中的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Spark:整个数据集中在一个执行器中 [英] Spark: Entire dataset concentrated in one executor

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark:整个数据集中在一个执行器中 [英] Spark: Entire dataset concentrated in one executor

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭