将 Airflow 用于频繁执行的任务 [英] Use Airflow for frequent tasks

查看:54
本文介绍了将 Airflow 用于频繁执行的任务的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们一直在使用 Airflow,它很棒.

We have been using Airflow for a while, it is just great.

现在我们正在考虑将一些非常频繁的任务也转移到我们的气流服务器中.

Now we are considering moving some of our very frequent tasks into our airflow server too.

假设我每秒运行一个脚本.

Let's say I have a script running every second.

使用气流安排它的最佳做法是什么:

What's the best practice to schedule it with airflow:

  1. 在每秒调度的 DAG 中运行此脚本.我非常怀疑这将是解决方案,DAGRUN 的开销很大

  1. Run this script in DAG that is scheduled every second. I highly doubt this will be the solution, there is significant overhead for a DAGRUN

在 6 小时后停止的 while 循环中运行此脚本,然后在 Airflow 上安排它每 6 小时运行一次?

Run this script in a while loop that stops after 6 hours, then schedule it on Airflow to be run every 6 hour?

创建一个没有计划的 DAG,将任务放入具有适当睡眠时间的 while True 循环中,因此除非出现错误,否则任务永远不会终止.

Create a DAG with no schedule, put the task in a while True loop with proper sleep time, so the task will never terminates unless there is an error.

还有其他建议吗?

或者这种任务不适合Airflow?应该使用 lambda 函数和 AWS 调度程序来实现吗?

Or this kind of task is just not suitable for Airflow? should do it with a lambda function and AWS scheduler?

干杯!

推荐答案

安排它的最佳做法是什么

What's the best practice to schedule it

  1. ...这种任务不适合 Airflow?

不合适.

特别是,您的气流可能配置为每 5 秒重新检查一次 DAG 集,这听起来不太适合 1 秒的任务.另外,调度开销与已执行工作的比率不会有吸引力.我想您可以安排五个同时执行的任务,每分钟 12 次,然后让它们休眠 0 到 4 秒,但这太疯狂了.并且您可能需要锁定自己";避免同时进行的兄弟任务相互影响.

In particular, your airflow is probably configured to re-examine the set of DAGs every 5 seconds, which doesn't sound like a good fit for a 1-second task. Plus the ratio of scheduling overhead to work performed would not be attractive. I suppose you could schedule five simultaneous tasks, twelve times per minute, and have them sleep zero to four seconds, but that's just crazy. And likely you would need to "lock against yourself" to avoid having simultaneous sibling tasks step on each other's toes.

六小时的建议 (2.) 并不疯狂.我将其视为 60 分钟的 @hourly 任务,因为开销相似.一小时后退出并让气流重生有几个好处.日志滚动定期发生.如果您的程序崩溃,它将在不久后重新启动.如果您的主机重新启动,那么您的程序很快就会重新启动.缺点是您的业务需求可能会查看超过一分钟"的时间.如太长了".在小时边界协调重叠的任务或任务之间的间隙可能会带来一些问题.

The six-hour suggestion (2.) is not crazy. I will view it as a sixty-minute @hourly task instead, since overheads are similar. Exiting after an hour and letting airflow respawn has several benefits. Log rolling happens at regular intervals. If your program crashes, it will be restarted before too long. If your host reboots, again your program is restarted before too long. Downside is that your business need may view "more than a minute" as "much too long". And coordinating overlapping tasks, or gap between tasks, at the hour boundary may pose some issues.

您陈述的需求与Supervisor 解决的问题完全匹配.就用那个.即使应用程序崩溃,即使主机崩溃,您也将始终只有一个事件循环副本在运行.日志滚动和其他管理细节已经得到解决.代码库是成熟的,很多人已经击败了它并合并了他们的功能请求.它适合你想要的.

Your stated needs exactly match the problem that Supervisor addresses. Just use that. You will always have exactly one copy of your event loop running, even if the app crashes, even if the host crashes. Log rolling and other administrative details have already been addressed. The code base is mature and lots of folks have beat on it and incorporated their feature requests. It fits what you want.

这篇关于将 Airflow 用于频繁执行的任务的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆