为什么Airflow在不重命名dag的情况下更改start_date? [英] Why does Airflow changing start_date without renaming dag?
问题描述
我是一名数据工程师,并定期处理气流。
I am a data engineer and work with airflow regularly.
当使用新的开始日期重新部署dag时,最佳实践如此处:
When redeploying dags with a new start date the best practice is as shown in the here:
请勿更改开始日期+间隔:运行DAG后,调度程序数据库将包含该DAG运行的实例。如果更改start_date或时间间隔并重新部署它,则调度程序可能会感到困惑,因为时间间隔不同或start_date倒退了。解决此问题的最佳方法是,只要更改start_date或时间间隔,即my_dag_v1和my_dag_v1,就应立即更改DAG的版本。这样,历史信息也会保留在旧版本中。
Don’t change start_date + interval: When a DAG has been run, the scheduler database contains instances of the run of that DAG. If you change the start_date or the interval and redeploy it, the scheduler may get confused because the intervals are different or the start_date is way back. The best way to deal with this is to change the version of the DAG as soon as you change the start_date or interval, i.e. my_dag_v1 and my_dag_v1. This way, historical information is also kept about the old version.
但是,删除所有先前的DAG和任务运行后,我尝试重新部署设置新的开始日期。它工作了一天(使用新的开始日期),然后又开始使用旧的
However after deleting all previous DAG and task runs I tried to redeploy a dag with a new start date. It worked as expected (with the new start date) for a day, then started to work with the old again
这是什么原因?
推荐答案
Airflow在表中维护有关过去运行的所有信息。 dag_run
。
Airflow maintains all of the information regarding the past runs in a table dag_run
.
清除先前的dag运行时,将从数据库中删除这些条目。因此,airflow将此dag视为新的dag,并在指定的时间开始。
When you clear the previous dag runs, these entries are dropped from the database. Hence, airflow treats this dag as a new dag and starts at the specified time.
Airflow检查最后一次dag的执行时间( start_date
最后一次运行),并添加您在 schedule_interval
中指定的 timedelta
对象。
Airflow checks the last dag execution time (start_date
of last run) and adds the timedelta
object which you have specified in schedule_interval
.
如果即使在清除dag运行后仍遇到困难,则可以执行以下操作:
If you are having difficulties even after clearing dag runs, few things you can do:
- 清除所有dag运行,保持dag暂停。创建一个dag运行,然后打开dag。
- 最好的方法是在
schedule_interval
。
- Rename the dag as suggested.
- Clear all the dag runs, keep the dag paused. Create a dag run and then turn the dag on. It will run on the scheduled time afterwards.
- The best approach would be to use crontab expression inside
schedule_interval
.
这篇关于为什么Airflow在不重命名dag的情况下更改start_date?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!