每天在特定时间运行 DAG [英] Run DAG at specific time each day

查看:29
本文介绍了每天在特定时间运行 DAG的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经多次阅读关于 schedule_intervalstart_date 和 Airflow 文档的多个示例,但我仍然无法理解:

如何在每天的特定时间执行我的 DAG?例如,现在是 9:30 (AM),我部署了我的 DAG,我希望它在 10:30 执行

我试过了

与 DAG(测试",default_args=default_args,描述=测试",schedule_interval = "0 10 * * *",start_date = days_ago(0),tags = [goodie"]) 作为 dag:

但由于某种原因,今天没有运行.我尝试了不同的 start_dates 或者 start_date = datetime.datetime(2021,6,23) 但它没有被执行.

如果我用 days_ago(1) 替换 days_ago(0) 它总是落后 1 天,即它今天没有运行但昨天运行了>

难道没有一种简单的方法可以说我现在部署我的 DAG,我想用这个 cron 语法来执行它"吗?(我认为这是大多数人想要的)而不是根据 start_dateschedule_interval 计算执行时间并弄清楚如何解释它?

解决方案

如果我用 days_ago(1) 替换 days_ago(0) 它总是落后 1 天

它不落后.您只是将 Airflow 调度机制与 cron 作业混淆了.在 cron 作业中,您只需提供一个 cron 表达式并进行相应安排 - 这不是它在 Airflow 中的工作方式.

在 Airflow 中,调度由 start_date + schedule interval 计算.Airflow 在间隔结束时执行作业.这与数据管道通常的工作方式一致.今天您正在处理昨天的数据,因此在这一天结束时,您希望启动一个处理昨天记录的流程.

原则上 - 切勿使用动态开始日期.

设置:

with DAG(测试",default_args=default_args,描述=测试",schedule_interval = "0 10 * * *",start_date = datetime(2021,06,23, 10 ,0), # 2021-06-23 10:00tags = [goodie"]) 作为 dag:

表示第一次开始于 2021-06-24 10:00 这次运行 execution_date 将是 2021-06-23 10:00代码>.第二次运行将于 2021-06-25 10:00 开始,这次运行 execution_date 将是 2021-06-24 10:00

由于这让许多新用户感到困惑,因此正在进行架构更改 AIP-39 更丰富的 scheduler_interval 它将在何时运行和此运行要考虑的时间间隔之间解耦 - 但如前所述,这尚未最终确定.

I've read multiple examples about schedule_interval, start_date and the Airflow docs multiple times aswell, and I still can't wrap my head around:

How do I get to execute my DAG at a specific time each day? E.g say it's now 9:30 (AM), I deploy my DAG and I want it to get executed at 10:30

I have tried


with DAG(
    "test",
    default_args=default_args,
    description= "test",
    schedule_interval = "0 10 * * *",
    start_date = days_ago(0),
    tags = ["goodie"]) as dag:

but for some reason that wasnt run today. I have tried different start_dates altso start_date = datetime.datetime(2021,6,23) but it does not get executed.

If I replace days_ago(0) with days_ago(1) it is behind 1 day all the time i.e it does not get run today but did run yesterday

Isn't there an easy way to say "I deploy my DAG now, and I want to get it executed with this cron-syntax" (which I assume is what most people want) instead of calculating an execution time, based on start_date, schedule_interval and figuring out, how to interpret it?

解决方案

If I replace days_ago(0) with days_ago(1) it is behind 1 day all the time

It's not behind. You are simply confusing Airflow scheduling mechanizem with cron jobs. In cron jobs you just provide a cron expression and it schedule accordingly - This is not how it works in Airflow.

In Airflow the scheduling is calculated by start_date + schedule interval. Airflow execute the job at the END of the interval. This is consistent with how data pipelines usually works. Today you are processing yesterday data so at the end of this day you want to start a process that will go over yesterday records.

As a rule - NEVER use dynamic start date.

Setting:

with DAG(
    "test",
    default_args=default_args,
    description= "test",
    schedule_interval = "0 10 * * *",
    start_date = datetime(2021,06,23, 10 ,0), # 2021-06-23 10:00
    tags = ["goodie"]) as dag:

Means that the first will start on 2021-06-24 10:00 this run execution_date will be 2021-06-23 10:00. The second run will start on 2021-06-25 10:00 this run execution_date will be 2021-06-24 10:00

Since this is a source of confusion to many new users there is an architecture change in progress AIP-39 Richer scheduler_interval which will decople between WHEN to run and WHAT interval to consider with this run - but as mention this is not yet finalized.

这篇关于每天在特定时间运行 DAG的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆