气流任务的粒度 [英] Granularity of tasks in airflow
问题描述
对于一项任务,有许多辅助任务 - 从文件/数据库中获取/保存属性、验证、审计.这些辅助方法并不耗时.
For one task, there are many helper tasks - fetch/save properties from file/db, validations, audits. These helper methods are not time consuming.
一个示例 DAG 流程,
One sample DAG flow,
fetch_data >> actual_processing >> validation >> save_data >> audit
在这种情况下的建议是什么:
What's the recommendation in this scenario:
- 为每个助手任务创建一个任务
- 将所有事情都集中在一项任务中?
假设有足够的资源,气流任务的开销是多少?
What's the overhead of an airflow task assuming there are enough resources?
推荐答案
Question-1
在这种情况下的建议是什么
What's the recommendation in this scenario
始终尝试在单个任务中保持最多的内容(最好让 fat 任务运行几分钟,而不是 lean 任务运行几秒钟)到(不是详尽的列表))
Always try to keep maximum stuff in single task (and preferably have fat tasks that run for several minutes than lean tasks running for few seconds) to (not exhaustive list)
1. 最小化调度延迟
2. 最小化 scheduler
/webserver
/SQLAlchemy
后端数据库的负载.
2. minimize load on scheduler
/ webserver
/ SQLAlchemy
backend db.
此规则的例外情况可能是(并非详尽列表)
Exception to this rule could be (not exhaustive list)
- 1.当幂等性要求您必须将您的任务分解成更小的步骤以防止浪费的重新计算/中断工作流,如使用运算符 doc
- 1. when idempotency dictates that you must break your tasks into smaller steps to prevent wasteful re-computation / breaking of workflow as told in Using Operators doc
一个操作符代表一个理想的幂等任务
An operator represents a single, ideally idempotent, task
- 2. 特殊情况,例如您使用的是
pools
限制外部资源的负载 =>在这种情况下,每个涉及外部资源的操作都必须被建模为一个单独的任务,以便通过pool
s 强制执行负载限制
- 2. peculiar cases such as if you are using
pools
to limit load on an external resource => in this case, each operation that touches that external resource has to be modelled as a separate task in order to enforce load-restriction viapool
s - 在每个心跳(通常大约 20-30 秒),它扫描元数据库和
DagBag
以确定task
的列表准备好运行例如喜欢- a
scheduled
上游任务已经运行的任务 retry_delay
已过期的up_for_retry
任务
- at every heartbeat (usually ~ 20-30 s), it scans meta-db and
DagBag
to determine the list oftask
s that are ready to run for e.g. like- a
scheduled
task who's upstream tasks have been run - an
up_for_retry
task who'sretry_delay
has expired
来自 旧文档
Airflow 调度器监控所有任务和所有 DAG,并触发已满足其依赖关系的任务实例.在...后面场景,它监控所有 DAG 的文件夹并保持同步它可能包含的对象,并定期(每分钟左右)检查活动任务,看是否可以触发.
The Airflow scheduler monitors all tasks and all DAGs, and triggers the task instances whose dependencies have been met. Behind the scenes, it monitors and stays in sync with a folder for all DAG objects it may contain, and periodically (every minute or so) inspects active tasks to see whether they can be triggered.
- 这意味着拥有更多的
task
(以及更多的连接/依赖)会增加调度器的工作量(更多要评估的检查) - this means that having more
task
s (as well as more connections / dependencies between them) will increase the workload of scheduler (more checks to be evaluated)
推荐阅读
对于运行大量快速/小任务的所有这些问题,我们需要快速的分布式任务管理,不需要先前的资源分配(如 Airflow 所做的),因为每个 ETL 任务需要很少的资源,并允许任务在其他立即.
For all these issues with running a massive number of fast/small tasks , we require fast distributed task management, that does not require previous resource allocation (as Airflow does), as each ETL task needs very few resources, and allows tasks to be executed one after the other immediately.
这篇关于气流任务的粒度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
- a
- a
问题 2
假设有足够的资源,气流任务的开销是多少?
What's the overhead of an airflow task assuming there are enough resources?
虽然我无法在这里提供技术上精确的答案,但请了解 Airflow 的调度程序本质上是基于 基于轮询的方法
While I can't provide a technically precise answer here, do understand that Airflow's scheduler essentially works on a poll-based approach