气流内存错误:任务退出,返回码为-9 [英] Airflow Memory Error: Task exited with return code -9

查看:247
本文介绍了气流内存错误:任务退出,返回码为-9的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据这两个 Link1 Link2 ,我的Airflow DAG运行返回错误 INFO-由于内存不足问题,任务以返回码-9 退出。我的DAG运行有10个任务/运算符,每个任务都很简单:

According to both of these Link1 and Link2, my Airflow DAG run is returning the error INFO - Task exited with return code -9 due to an out-of-memory issue. My DAG run has 10 tasks/operators, and each task simply:


  1. 进行查询以获取我的BigQuery表之一,然后

  2. 将结果写入我的Mongo数据库中的集合中。

10个BigQuery表的大小范围从1MB到400MB,所有10个表的总大小约为1GB。我的Docker容器默认具有2GB的内存,我已将其增加到4GB,但是我仍然从一些任务中收到此错误。我对此感到困惑,因为4GB应该足够用于此目的。我也很担心,因为将来这些表可能会变大(单个表查询可能为1-2GB),并且我想避免使用这些返回代码-9 当时的错误。

The size of the 10 BigQuery tables range from 1MB to 400MB, and the total size of all 10 tables is ~1GB. My docker container has default 2GB of memory and I've increased this to 4GB, however I am still receiving this error from a few of the tasks. I am confused about this, as 4GB should be plenty of memory for this. I am also concerned because, in the future, these tables may become larger (a single table query could be 1-2GB), and I'd like to avoid these return code -9 errors at that time.

对此有任何想法或建议将不胜感激。我不确定如何处理此问题,因为DAG的目的是每天将数据从BigQuery传输到Mongo,并且基于大小,DAG任务的内存中的查询/数据必然很大

Any thoughts or advice would be greatly appreciated on this. I'm not quite sure how to handle this issue, since the point of the DAG is to transfer data from BigQuery to Mongo daily, and the queries / data in-memory for the DAG's tasks is necessarily fairly large then, based on the size of the tables.

推荐答案

如您所说,收到的错误消息与内存不足问题相对应。

As you said, the error message you get corresponds to an out of memory issue.

请参见官方文档


DAG执行受RAM限制。每个任务执行都以两个
Airflow流程开始:任务执行和监视。当前,每个节点
最多可以执行6个并发任务。可以消耗更多的内存,
取决于DAG的大小。

DAG execution is RAM limited. Each task execution starts with two Airflow processes: task execution and monitoring. Currently, each node can take up to 6 concurrent tasks. More memory can be consumed, depending on the size of the DAG.

任何GKE节点的内存压力都很大将导致Kubernetes调度程序将Pod从节点中逐出,以减轻这种压力。尽管GKE内运行着许多不同的Airflow组件,但大多数组件并不会占用太多内存,因此最常发生的情况是用户上载了资源密集型DAG。气流工作人员运行这些DAG,耗尽资源,然后将其驱逐出去。

High memory pressure in any of the GKE nodes will lead the Kubernetes scheduler to evict pods from nodes in an attempt to relieve that pressure. While many different Airflow components are running within GKE, most don't tend to use much memory, so the case that happens most frequently is that a user uploaded a resource-intensive DAG. The Airflow workers run those DAGs, run out of resources, and then get evicted.

您可以按照以下步骤进行检查:

You can check it with following steps:


  1. 在云控制台中,导航到 Kubernetes引擎-> 工作负载

单击 airflow-worker ,然后在 Managed下查看豆荚

如果有豆荚显示被驱逐,请单击每个驱逐荚,然后在窗口顶部查找节点资源不足:内存消息。

If there are pods that show Evicted, click each evicted pod and look for the The node was low on resource: memory message at the top of the window.

有哪些方法可以解决OOM问题?

What are the possible ways to fix OOM issue?


  • 使用以下方法创建新的Cloud Composer环境

  • 确保DAG中的任务是幂等,这意味着运行同一DAG的结果多空闲时间应与运行一次的结果相同。

  • 配置任务重试,方法是设置任务的重试次数-这样,当任务获得 -9 由调度程序执行,它将转到 up_for_retry 而不是失败

  • Create a new Cloud Composer environment with a larger machine type than the current machine type.
  • Ensure that the tasks in the DAG are idempotent, which means that the result of running the same DAG run multiple times should be the same as the result of running it once.
  • Configure task retries by setting the number of retries on the task - this way when your task gets -9'ed by the scheduler it will go to up_for_retry instead of failed

此外,您还可以检查CPU的行为:

Additionally you can check the behavior of CPU:


  1. 云控制台,导航到 Kubernetes引擎-> 集群

  2. 找到节点池在页面底部,然后展开默认池部分

  3. 单击实例组

  4. 下列出的链接,切换到监视选项卡,您可以在其中找到 CPU利用率

  1. In the Cloud Console, navigate to Kubernetes Engine -> Clusters
  2. Locate Node Pools at the bottom of the page, and expand the default-pool section
  3. Click the link listed under Instance groups
  4. Switch to the Monitoring tab, where you can find CPU utilization

理想情况下,GCE实例不应运行超过70%的CPU在任何时候都可能会出错,否则Composer环境可能会在资源使用期间变得不稳定。

Ideally, the GCE instances shouldn't be running over 70% CPU at all times, or the Composer environment may become unstable during resource usage.

我希望上述信息对您有用。

I hope you find the above pieces of information useful.

这篇关于气流内存错误:任务退出,返回码为-9的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆