气流任务的粒度 [英] Granularity of tasks in airflow

查看：27 发布时间：2021/10/26 17:58:54 airflow

本文介绍了气流任务的粒度的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

对于一项任务，有许多辅助任务 - 从文件/数据库中获取/保存属性、验证、审计.这些辅助方法并不耗时.

For one task, there are many helper tasks - fetch/save properties from file/db, validations, audits. These helper methods are not time consuming.

一个示例 DAG 流程，

One sample DAG flow,

fetch_data >> actual_processing >> validation >> save_data >> audit

在这种情况下的建议是什么:

What's the recommendation in this scenario:

为每个助手任务创建一个任务
将所有事情都集中在一项任务中?

假设有足够的资源，气流任务的开销是多少?

What's the overhead of an airflow task assuming there are enough resources?

推荐答案

Question-1

在这种情况下的建议是什么

What's the recommendation in this scenario

始终尝试在单个任务中保持最多的内容(最好让 fat 任务运行几分钟，而不是 lean 任务运行几秒钟)到(不是详尽的列表))

Always try to keep maximum stuff in single task (and preferably have fat tasks that run for several minutes than lean tasks running for few seconds) to (not exhaustive list)

1. 最小化调度延迟

2. 最小化 scheduler/webserver/SQLAlchemy 后端数据库的负载.

2. minimize load on scheduler / webserver / SQLAlchemy backend db.

此规则的例外情况可能是(并非详尽列表)

Exception to this rule could be (not exhaustive list)

1.当幂等性要求您必须将您的任务分解成更小的步骤以防止浪费的重新计算/中断工作流，如使用运算符 doc

1. when idempotency dictates that you must break your tasks into smaller steps to prevent wasteful re-computation / breaking of workflow as told in Using Operators doc

一个操作符代表一个理想的幂等任务

An operator represents a single, ideally idempotent, task

2. 特殊情况，例如您使用的是 pools 限制外部资源的负载 =>在这种情况下，每个涉及外部资源的操作都必须被建模为一个单独的任务，以便通过 pools

2. peculiar cases such as if you are using pools to limit load on an external resource => in this case, each operation that touches that external resource has to be modelled as a separate task in order to enforce load-restriction via pools

问题 2

假设有足够的资源，气流任务的开销是多少?

What's the overhead of an airflow task assuming there are enough resources?

虽然我无法在这里提供技术上精确的答案，但请了解 Airflow 的调度程序本质上是基于 基于轮询的方法

While I can't provide a technically precise answer here, do understand that Airflow's scheduler essentially works on a poll-based approach

在每个心跳(通常大约 20-30 秒)，它扫描元数据库和 DagBag 以确定 task 的列表准备好运行例如喜欢
- a scheduled 上游任务已经运行的任务
- retry_delay 已过期的 up_for_retry 任务
- at every heartbeat (usually ~ 20-30 s), it scans meta-db and DagBag to determine the list of tasks that are ready to run for e.g. like
  - a scheduled task who's upstream tasks have been run
  - an up_for_retry task who's retry_delay has expired
  来自旧文档
  
  Airflow 调度器监控所有任务和所有 DAG，并触发已满足其依赖关系的任务实例.在...后面场景，它监控所有 DAG 的文件夹并保持同步它可能包含的对象，并定期(每分钟左右)检查活动任务，看是否可以触发.
  
  The Airflow scheduler monitors all tasks and all DAGs, and triggers the task instances whose dependencies have been met. Behind the scenes, it monitors and stays in sync with a folder for all DAG objects it may contain, and periodically (every minute or so) inspects active tasks to see whether they can be triggered.
  - 这意味着拥有更多的task(以及更多的连接/依赖)会增加调度器的工作量(更多要评估的检查)

查看全文

气流任务的粒度 [英] Granularity of tasks in airflow

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

气流任务的粒度 [英] Granularity of tasks in airflow

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭