气流任务的粒度 [英] Granularity of tasks in airflow

查看:27
本文介绍了气流任务的粒度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于一项任务,有许多辅助任务 - 从文件/数据库中获取/保存属性、验证、审计.这些辅助方法并不耗时.

For one task, there are many helper tasks - fetch/save properties from file/db, validations, audits. These helper methods are not time consuming.

一个示例 DAG 流程,

One sample DAG flow,

fetch_data >> actual_processing >> validation >> save_data >> audit

在这种情况下的建议是什么:

What's the recommendation in this scenario:

  • 为每个助手任务创建一个任务
  • 将所有事情都集中在一项任务中?

假设有足够的资源,气流任务的开销是多少?

What's the overhead of an airflow task assuming there are enough resources?

推荐答案

Question-1

在这种情况下的建议是什么

What's the recommendation in this scenario

始终尝试在单个任务中保持最多的内容(最好让 fat 任务运行几分钟,而不是 lean 任务运行几秒钟)到(不是详尽的列表))

Always try to keep maximum stuff in single task (and preferably have fat tasks that run for several minutes than lean tasks running for few seconds) to (not exhaustive list)

2. 最小化 scheduler/webserver/SQLAlchemy 后端数据库的负载.

2. minimize load on scheduler / webserver / SQLAlchemy backend db.

此规则的例外情况可能是(并非详尽列表)

Exception to this rule could be (not exhaustive list)

  • 1.幂等性要求您必须将您的任务分解成更小的步骤以防止浪费的重新计算/中断工作流,如使用运算符 doc
  • 1. when idempotency dictates that you must break your tasks into smaller steps to prevent wasteful re-computation / breaking of workflow as told in Using Operators doc

一个操作符代表一个理想的幂等任务

An operator represents a single, ideally idempotent, task

  • 2. 特殊情况,例如您使用的是 pools 限制外部资源的负载 =>在这种情况下,每个涉及外部资源的操作都必须被建模为一个单独的任务,以便通过 pools
  • 强制执行负载限制

    • 2. peculiar cases such as if you are using pools to limit load on an external resource => in this case, each operation that touches that external resource has to be modelled as a separate task in order to enforce load-restriction via pools
    • 问题 2

      假设有足够的资源,气流任务的开销是多少?

      What's the overhead of an airflow task assuming there are enough resources?

      虽然我无法在这里提供技术上精确的答案,但请了解 Airflow 的调度程序本质上是基于 基于轮询的方法

      While I can't provide a technically precise answer here, do understand that Airflow's scheduler essentially works on a poll-based approach

      • 在每个心跳(通常大约 20-30 秒),它扫描元数据库和 DagBag 以确定 task 的列表准备好运行例如喜欢
        • a scheduled 上游任务已经运行的任务
        • retry_delay 已过期的 up_for_retry 任务
        • at every heartbeat (usually ~ 20-30 s), it scans meta-db and DagBag to determine the list of tasks that are ready to run for e.g. like
          • a scheduled task who's upstream tasks have been run
          • an up_for_retry task who's retry_delay has expired

          来自 旧文档

          Airflow 调度器监控所有任务和所有 DAG,并触发已满足其依赖关系的任务实例.在...后面场景,它监控所有 DAG 的文件夹并保持同步它可能包含的对象,并定期(每分钟左右)检查活动任务,看是否可以触发.

          The Airflow scheduler monitors all tasks and all DAGs, and triggers the task instances whose dependencies have been met. Behind the scenes, it monitors and stays in sync with a folder for all DAG objects it may contain, and periodically (every minute or so) inspects active tasks to see whether they can be triggered.

          • 这意味着拥有更多的task(以及更多的连接/依赖)会增加调度器的工作量(更多要评估的检查)
            • this means that having more tasks (as well as more connections / dependencies between them) will increase the workload of scheduler (more checks to be evaluated)
            • 推荐阅读

              对于运行大量快速/小任务的所有这些问题,我们需要快速的分布式任务管理,不需要先前的资源分配(如 Airflow 所做的),因为每个 ETL 任务需要很少的资源,并允许任务在其他立即.

              For all these issues with running a massive number of fast/small tasks , we require fast distributed task management, that does not require previous resource allocation (as Airflow does), as each ETL task needs very few resources, and allows tasks to be executed one after the other immediately.

              这篇关于气流任务的粒度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆