谷歌云Dataproc配置问题 [英] Google Cloud Dataproc configuration issues

查看:537
本文介绍了谷歌云Dataproc配置问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经遇到一些星火LDA的主题造型的各种问题(主要是解离在看似随机时间间隔误差)我一直在运行,我认为主要有与我的遗嘱执行人分配内存不足的事。这似乎是相关的问题自动群集配置。我的最新尝试使用的主机和工作节点N1标准-8机(8个核心,30GB RAM)(6名工人,因此总共48个内核)。

I've been encountering various issues in some Spark LDA topic modeling (mainly disassociation errors at seemingly random intervals) I've been running, which I think mainly have to do with insufficient memory allocation on my executors. This would seem to be related to problematic automatic cluster configuration. My latest attempt uses n1-standard-8 machines (8 cores, 30GB RAM) for both the master and worker nodes (6 workers, so 48 total cores).

但是,当我看到 /etc/spark/conf/spark-defaults.conf 我看到这一点:

But when I look at /etc/spark/conf/spark-defaults.conf I see this:

spark.master yarn-client
spark.eventLog.enabled true
spark.eventLog.dir hdfs://cluster-3-m/user/spark/eventlog

# Dynamic allocation on YARN
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.minExecutors 1
spark.dynamicAllocation.initialExecutors 100000
spark.dynamicAllocation.maxExecutors 100000
spark.shuffle.service.enabled true
spark.scheduler.minRegisteredResourcesRatio 0.0

spark.yarn.historyServer.address cluster-3-m:18080
spark.history.fs.logDirectory hdfs://cluster-3-m/user/spark/eventlog

spark.executor.cores 4
spark.executor.memory 9310m
spark.yarn.executor.memoryOverhead 930

# Overkill
spark.yarn.am.memory 9310m
spark.yarn.am.memoryOverhead 930

spark.driver.memory 7556m
spark.driver.maxResultSize 3778m
spark.akka.frameSize 512

# Add ALPN for Bigtable
spark.driver.extraJavaOptions -Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.3.v20150130.jar
spark.executor.extraJavaOptions -Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.3.v20150130.jar

但这些值没有太大的意义。为什么只使用4/8执行人核心?而只有9.3 / 30GB RAM?我的IM pression是,所有这些配置应该被自动处理,但即使我在尝试手动调整没有得到我的任何地方。

But these values don't make much sense. Why use only 4/8 executor cores? And only 9.3 / 30GB RAM? My impression was that all this config was supposed to be handled automatically, but even my attempts at manual tweaking aren't getting me anywhere.

例如,我试图启动与外壳:

For instance, I tried launching the shell with:

spark-shell --conf spark.executor.cores=8 --conf spark.executor.memory=24g

但后来这个失败,

But then this failed with

java.lang.IllegalArgumentException: Required executor memory (24576+930 MB) is above the max threshold (22528 MB) of this cluster! Please increase the value of 'yarn.scheduler.maximum-allocation-mb'.

我试图改变在 /etc/hadoop/conf/yarn-site.xml 相关联的价值,没有任何影响。甚至当我尝试不同的群集设置(例如,使用与60多个GB RAM执行人)我结束了同样的问题。出于某种原因,最大阈值保持在22528MB。

I tried changing the associated value in /etc/hadoop/conf/yarn-site.xml, to no effect. Even when I try a different cluster setup (e.g. using executors with 60+ GB RAM) I end up with the same problem. For some reason the max threshold remains at 22528MB.

时有什么我做错了,或者这是与谷歌的自动配置有问题吗?

Is there something I'm doing wrong here, or is this is a problem with Google's automatic configuration?

推荐答案

有一些已知的问题,在默认情况下的集群存储CONFIGS其中主机型是从工人的机型不同,但在你的情况不会出现是主要的问题。

There are some known issues with default memory configs in clusters where the master machine type is different from the worker machine type, though in your case that doesn't appear to be the main issue.

当你看到以下内容:

spark.executor.cores 4
spark.executor.memory 9310m

这实际上意味着每个工人节点将运行2个执行者,每个执行人将利用4芯,使所有8个内核确实用完了每个工人。这样,如果我们给一台机器的AppMaster上半年,AppMaster可以成功地包装旁边的一个执行者。

this actually means that each worker node will run 2 executors, and each executor will utilize 4 cores such that all 8 cores are indeed used up on each worker. This way, if we give the AppMaster half of one machine, the AppMaster can successfully be packed next to an executor.

给NodeManagers的内存量需要离开一些开销为节点管理器守护程序本身和杂项。其它守护服务,如的DataNode,所以〜80%是留给NodeManagers。此外,分配必须的最小YARN分配的倍数,因此地板到最近分配的多个后,这也正是22528MB来自为n1的标准-8。

The amount of memory given to NodeManagers needs to leave some overhead for the NodeManager daemon itself, and misc. other daemon services such as the DataNode, so ~80% is left for NodeManagers. Additionally, allocations must be a multiple of the minimum YARN allocation, so after flooring to the nearest allocation multiple, that's where the 22528MB comes from for n1-standard-8.

如果您添加有60多个GB的RAM的工人,那么只要您使用相同的内存大小的主节点,那么你的的会看到更高的最大阈值数。

If you add workers that have 60+ GB of RAM, then as long as you use a master node of the same memory size then you should be seeing a higher max threshold number.

无论哪种方式,如果你看到OOM的问题,那么它是没有这么多,才是最重要的每个执行人的记忆,而是每个任务的内存。如果你正在增加 spark.executor.cores 在同一时间为 spark.executor.memory ,那么内存每个任务实际上没有被增加了,这样你就不会真正被赋予了更多的空间在这种情况下,您的应用程序逻辑;星火将使用 spark.executor.cores 来确定并发任务的数量在同一个内存空间中运行。

Either way, if you're seeing OOM issues, then it's not so much the memory per-executor that matters the most, but rather the memory per-task. And if you are increasing spark.executor.cores at the same time as spark.executor.memory, then the memory per-task isn't actually being increased, so you won't really be giving more headroom to your application logic in that case; Spark will use spark.executor.cores to determine the number of concurrent tasks to run in the same memory space.

要实际获得每更多的内存的任务的,你应该尽量多是:

To actually get more memory per task, you should mostly try:


  1. 使用N1-highmem- *机器类型

  2. 尝试减少 spark.executor.cores而留下spark.executor.memory相同

  3. 尝试增加spark.executor.memory而留下spark.executor.cores相同

  1. Use n1-highmem-* machine types
  2. Try reducing spark.executor.cores while leaving spark.executor.memory the same
  3. Try increasing spark.executor.memory while leaving spark.executor.cores the same

如果你这样做(2)或(3)上方,然后你就确实是离开内核闲置相比,它试图占领所有核心默认的配置,但是这真的让每个任务更多的内存的唯一方法除了将 HIGHMEM 实例。

If you do (2) or (3) above then you'll indeed be leaving cores idle compared to the default config which tries to occupy all cores, but that's really the only way to get more memory per-task aside from going to highmem instances.

这篇关于谷歌云Dataproc配置问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆