所有重试都在运行Airflow DAG [英] Airflow DAG is running for all the retries

查看:488
本文介绍了所有重试都在运行Airflow DAG的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

几个月以来,我一直在运行DAG,从上周开始,它一直处于异常状态。我运行的bash运算符正在执行shell脚本,在shell脚本中,我们有一个配置单元查询。
重试次数未设置为4,如下所示。

I have a DAG running since few months and from last one week it's behaving abnormal. i am running a bash operator which is executing a shell script and in shell script we have a hive query. no of retries set to 4 as below.

default_args = {
'所有者':'气流',
'depends_on_past' :错误,
'email':['airflow@example.com'],
'email_on_failure':错误,
'email_on_retry':错误,
'重试':4
'retry_delay':timedelta(minutes = 5)
}

default_args = { 'owner': 'airflow', 'depends_on_past': False, 'email': ['airflow@example.com'], 'email_on_failure': False, 'email_on_retry': False, 'retries': 4, 'retry_delay': timedelta(minutes=5) }

我可以在日志中看到它是在一段时间(约5至6分钟)后触发配置单元查询并丢失心跳,然后重试。
纱线显示查询尚未完成,但是气流触发了下一次运行。现在在yarn中对同一任务运行2个查询(第一次运行一次,重试第二次)类似地,此dag触发同一任务的5个查询(重试为4),并在最后一个查询中显示失败状态。
有趣的一点是,相同的dag长时间运行良好。同样,这也是所有与蜂巢相关的问题在生产中的问题。今天
我已升级到最新版本的airflow v 1.10.9。
在这种情况下,我使用的是LocalExecuter。

i can see in the log that it's triggering the hive query and loosing the heartbeats after some time(around 5 to 6 minutes) and going for the retry. Yarn is showing that query is not yet finished but airflow triggered the next run. now in the yarn 2 queries are running (one for the first run and second for the retry) for the same task.similarly this dag is triggering 5 queries(as retry is 4) for the same task and showing the failed status in the last. Interesting point is that the same dag was running fine from long time. also, this is the issue will all the dags related to hive in the production. today i upgraded to latest version of airflow v 1.10.9. I am using LocalExecuter in this case.

有人遇到过类似的问题吗?

Did anyone have faced the similar issue?

推荐答案

Airflow UI不会自行启动重试,无论它是否已连接到后端DB。您的任务执行者似乎正在使用Zombie,在这种情况下,Scheduler的Zombie检测启动并调用任务实例(TI)的handle_failure方法。简而言之,您可以在dag中覆盖该方法,并添加一些日志记录以查看发生的事情,实际上,您应该能够使用Hadoop RM并检查作业的状态,并据此做出包括取消重试的决定。

Airflow UI doesn't initiate the retries on its own, irrespective of whether it's connected to backend DB or not. It seems like your task executors are going Zombie, in that case Scheduler's Zombie detection kicks in and call the task instances (TI's) handle_failure method. So in nutshell, you can override that method in your dag and add some logging to see at what's happening, in fact you should be able to Hadoop RM and check the status of your job and make a decision accordingly including canceling retries.

例如,请参见代码,我编写该代码只是为了处理Zombie故障。

For example, see this code, which I wrote to handle Zombie failures only.

这篇关于所有重试都在运行Airflow DAG的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆