Amazon MWAA Airflow - 任务容器在没有日志的情况下关闭/停止/终止 [英] Amazon MWAA Airflow - Tasks container shut down / stop / killed without logs

查看:47
本文介绍了Amazon MWAA Airflow - 任务容器在没有日志的情况下关闭/停止/终止的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们使用 Amazon MWAA Airflow,很少有标记为FAILED"的任务;但根本没有日志.就好像容器在没有注意到我们的情况下关闭了一样.

We use Amazon MWAA Airflow, rarely some task as marked as "FAILED" but there is no logs at all. As if the container had been shut down without noticing us.

我找到了这个链接:https://cloud.google.com/composer/docs/how-to/using/troubleshooting-dags#task_fails_without_Emitting_logs这由机器上的OOM解释.但是我们的任务几乎不使用 CPU 和 RAM.他们只对 AWS API 进行 1 次 HTTP 调用.非常轻.

I have found this link: https://cloud.google.com/composer/docs/how-to/using/troubleshooting-dags#task_fails_without_emitting_logs Which explain this by OOM on the machine. But our tasks are doing almost nothing with CPU and RAM. They only do 1 HTTP call to AWS API. So very light.

在 Cloudwatch 上,我可以看到没有其他任务在同一个容器上启动(DAG 运行通过打印容器 IP 开始,因此我可以在所有任务上搜索此 IP).

On Cloudwatch, I can see that no others tasks are launched on the same container (the DAG run start by printing the container IP, so I can search this IP on all tasks).

如果有人有想法,那就太好了,谢谢!

If someone has an idea, would be great, thanks !

推荐答案

MWAA 使用 ECS 作为后端,工作方式是 ECS 将根据集群中运行的任务数量自动调整工作线程的数量.对于小环境,默认每个worker可以处理5个任务.如果有超过 5 个任务,那么它会向外扩展另一个 worker,依此类推.

MWAA make use of ECS as a backend and the way things work is that ECS will autoscale the number of worker according to the number of tasks running in the cluster. For a small environment, each worker can handle 5 tasks by default. If there's more than 5 tasks then it will scale out another worker and so on.

我们不对气流进行任何计算(批处理、长时间运行的作业),我们的 Dag 主要是对其他服务的 API 请求,这意味着我们的 Dag 运行速度快且寿命短.有时,我们可以在很短的时间(几秒钟)内完成八个或更多任务.在这种情况下,自动缩放将触发向外扩展并向集群添加一个或多个工作器.然后,由于这些任务只是 API 请求,它执行得非常快,任务数量立即下降到 0,从而触发规模缩小(删除工作人员).如果在那个确切的时刻安排了另一个任务,那么气流将最终在正在移除的容器上运行该任务,并且您的任务将在中间被杀死而没有任何通知(竞争条件).发生这种情况时,您通常会看到不完整的日志.

We don't do any compute on airflow (batch, long running job), our Dags are mainly API requests to other service, this mean our Dags run fast and are short lives. From time to time, we can spike to eight or more tasks for a very short period of time (few seconds). In that case, the autoscaling will trigger a scale out and add a worker(s) to the cluster. Then, since those tasks are only API request, it get executed very quickly and immediately the number of task goes down to 0 which trigger a scale in (remove worker(s)). If at that exact moment another task is schedule, then airflow will eventually run the task on a container being remove and your task will get killed in the middle without any notice (race condition). You usually see incomplete logs when this happen.

第一个解决方法是通过冻结集群中的工作器数量来禁用自动缩放.您可以将 min 和 max 设置为适当的工作人员数量,这取决于您的工作量.我们同意,我们失去了服务的弹性.

The first workaround is to disable autoscaling by freezing the number of worker in the cluster. You can set the min and max to the appropriate number of worker which will depend on your workload. We agree, we lose the elasticity of the service.

$ aws mwaa update-environment --name MyEnvironmentName --min-workers 2 --max-workers 2

AWS 建议的另一个解决方案是始终运行一个虚拟任务(无限循环),这样您就永远不会扩展所有工作线程.

Another solution suggest by AWS will be to have always one dummy task running (an infinite loop) so you never endup scaling in all your worker.

AWS 告诉我们,他们正在研究改进执行程序的解决方案.

AWS told us they are working on a solution to improve executor.

这篇关于Amazon MWAA Airflow - 任务容器在没有日志的情况下关闭/停止/终止的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆