Google Cloud Dataproc 配置问题 [英] Google Cloud Dataproc configuration issues

查看:31
本文介绍了Google Cloud Dataproc 配置问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在运行的一些 Spark LDA 主题建模中遇到了各种问题(主要是看似随机间隔的分离错误),我认为这主要与我的执行程序上的内存分配不足有关.这似乎与有问题的自动集群配置有关.我最近的一次尝试使用 n1-standard-8 机器(8 个内核,30GB RAM)作为主节点和工作节点(6 个工作节点,所以总共 48 个内核).

I've been encountering various issues in some Spark LDA topic modeling (mainly disassociation errors at seemingly random intervals) I've been running, which I think mainly have to do with insufficient memory allocation on my executors. This would seem to be related to problematic automatic cluster configuration. My latest attempt uses n1-standard-8 machines (8 cores, 30GB RAM) for both the master and worker nodes (6 workers, so 48 total cores).

但是当我查看 /etc/spark/conf/spark-defaults.conf 时,我看到了:

But when I look at /etc/spark/conf/spark-defaults.conf I see this:

spark.master yarn-client
spark.eventLog.enabled true
spark.eventLog.dir hdfs://cluster-3-m/user/spark/eventlog

# Dynamic allocation on YARN
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.minExecutors 1
spark.dynamicAllocation.initialExecutors 100000
spark.dynamicAllocation.maxExecutors 100000
spark.shuffle.service.enabled true
spark.scheduler.minRegisteredResourcesRatio 0.0

spark.yarn.historyServer.address cluster-3-m:18080
spark.history.fs.logDirectory hdfs://cluster-3-m/user/spark/eventlog

spark.executor.cores 4
spark.executor.memory 9310m
spark.yarn.executor.memoryOverhead 930

# Overkill
spark.yarn.am.memory 9310m
spark.yarn.am.memoryOverhead 930

spark.driver.memory 7556m
spark.driver.maxResultSize 3778m
spark.akka.frameSize 512

# Add ALPN for Bigtable
spark.driver.extraJavaOptions -Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.3.v20150130.jar
spark.executor.extraJavaOptions -Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.3.v20150130.jar

但是这些值没有多大意义.为什么只使用 4/8 个执行器核心?并且只有 9.3/30GB RAM?我的印象是所有这些配置都应该自动处理,但即使我尝试手动调整也无济于事.

But these values don't make much sense. Why use only 4/8 executor cores? And only 9.3 / 30GB RAM? My impression was that all this config was supposed to be handled automatically, but even my attempts at manual tweaking aren't getting me anywhere.

例如,我尝试使用以下命令启动 shell:

For instance, I tried launching the shell with:

spark-shell --conf spark.executor.cores=8 --conf spark.executor.memory=24g

但是这失败了

java.lang.IllegalArgumentException: Required executor memory (24576+930 MB) is above the max threshold (22528 MB) of this cluster! Please increase the value of 'yarn.scheduler.maximum-allocation-mb'.

我尝试更改 /etc/hadoop/conf/yarn-site.xml 中的关联值,但没有任何效果.即使我尝试不同的集群设置(例如使用 60+ GB RAM 的执行程序),我最终还是遇到了同样的问题.出于某种原因,最大阈值保持在 22528MB.

I tried changing the associated value in /etc/hadoop/conf/yarn-site.xml, to no effect. Even when I try a different cluster setup (e.g. using executors with 60+ GB RAM) I end up with the same problem. For some reason the max threshold remains at 22528MB.

我在这里做错了什么,还是 Google 的自动配置有问题?

Is there something I'm doing wrong here, or is this is a problem with Google's automatic configuration?

推荐答案

在集群中的默认内存配置存在一些已知问题,其中主机器类型与工作机器类型不同,但在您的情况下没有出现成为主要问题.

There are some known issues with default memory configs in clusters where the master machine type is different from the worker machine type, though in your case that doesn't appear to be the main issue.

当您看到以下内容时:

spark.executor.cores 4
spark.executor.memory 9310m

这实际上意味着每个工作节点将运行 2 个执行程序,每个执行程序将使用 4 个内核,因此所有 8 个内核确实在每个工作程序上用完.这样,如果我们给 AppMaster 一半的机器,AppMaster 就可以成功打包到一个 executor 旁边.

this actually means that each worker node will run 2 executors, and each executor will utilize 4 cores such that all 8 cores are indeed used up on each worker. This way, if we give the AppMaster half of one machine, the AppMaster can successfully be packed next to an executor.

分配给 NodeManagers 的内存量需要为 NodeManager 守护进程本身留下一些开销,并且杂项.其他守护进程服务,例如 DataNode,所以大约 80% 留给 NodeManagers.此外,分配必须是最小 YARN 分配的倍数,因此在取到最近的分配倍数后,这就是 n1-standard-8 的 22528MB 的来源.

The amount of memory given to NodeManagers needs to leave some overhead for the NodeManager daemon itself, and misc. other daemon services such as the DataNode, so ~80% is left for NodeManagers. Additionally, allocations must be a multiple of the minimum YARN allocation, so after flooring to the nearest allocation multiple, that's where the 22528MB comes from for n1-standard-8.

如果您添加具有 60 GB 以上 RAM 的工作线程,那么只要您使用相同内存大小的主节点,那么您应该会看到更高的最大阈值数.

If you add workers that have 60+ GB of RAM, then as long as you use a master node of the same memory size then you should be seeing a higher max threshold number.

无论哪种方式,如果您看到 OOM 问题,那么最重要的不是每个执行器的内存,而是每个任务的内存.如果你在增加 spark.executor.cores 的同时增加 spark.executor.memory,那么每个任务的内存实际上并没有增加,所以你在这种情况下,不会真正为您的应用程序逻辑提供更多空间;Spark 将使用 spark.executor.cores 来确定在同一内存空间中运行的并发任务数.

Either way, if you're seeing OOM issues, then it's not so much the memory per-executor that matters the most, but rather the memory per-task. And if you are increasing spark.executor.cores at the same time as spark.executor.memory, then the memory per-task isn't actually being increased, so you won't really be giving more headroom to your application logic in that case; Spark will use spark.executor.cores to determine the number of concurrent tasks to run in the same memory space.

要真正为每个任务获得更多内存,您应该主要尝试:

To actually get more memory per task, you should mostly try:

  1. 使用 n1-highmem-* 机器类型
  2. 尝试减少 spark.executor.cores,同时保持 spark.executor.memory 不变
  3. 尝试增加 spark.executor.memory 的同时保持 spark.executor.cores 不变
  1. Use n1-highmem-* machine types
  2. Try reducing spark.executor.cores while leaving spark.executor.memory the same
  3. Try increasing spark.executor.memory while leaving spark.executor.cores the same

如果您执行上面的 (2) 或 (3),那么与尝试占用所有内核的默认配置相比,您确实会让内核处于空闲状态,但这确实是除此之外每个任务获得更多内存的唯一方法转到 highmem 实例.

If you do (2) or (3) above then you'll indeed be leaving cores idle compared to the default config which tries to occupy all cores, but that's really the only way to get more memory per-task aside from going to highmem instances.

这篇关于Google Cloud Dataproc 配置问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆