并行运行气流任务/损失 [英] Running airflow tasks/dags in parallel

查看:78
本文介绍了并行运行气流任务/损失的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用气流编排一些python脚本.我有一个主要"的dag,从中可以运行几个subdag.我的主要任务应该按照以下概述运行:

I'm using airflow to orchestrate some python scripts. I have a "main" dag from which several subdags are run. My main dag is supposed to run according to the following overview:

通过以下几行,我已经设法在主菜单中找到了该结构:

I've managed to get to this structure in my main dag by using the following lines:

etl_internal_sub_dag1 >> etl_internal_sub_dag2 >> etl_internal_sub_dag3
etl_internal_sub_dag3 >> etl_adzuna_sub_dag
etl_internal_sub_dag3 >> etl_adwords_sub_dag
etl_internal_sub_dag3 >> etl_facebook_sub_dag
etl_internal_sub_dag3 >> etl_pagespeed_sub_dag

etl_adzuna_sub_dag >> etl_combine_sub_dag
etl_adwords_sub_dag >> etl_combine_sub_dag
etl_facebook_sub_dag >> etl_combine_sub_dag
etl_pagespeed_sub_dag >> etl_combine_sub_dag

我希望气流要做的是先运行etl_internal_sub_dag1,然后运行etl_internal_sub_dag2,然后运行etl_internal_sub_dag3. etl_internal_sub_dag3完成后,我希望etl_adzuna_sub_dagetl_adwords_sub_dagetl_facebook_sub_dagetl_pagespeed_sub_dag并行运行.最后,当最后四个脚本完成时,我希望运行etl_combine_sub_dag.

What I want airflow to do is to first run the etl_internal_sub_dag1 then the etl_internal_sub_dag2 and then the etl_internal_sub_dag3. When the etl_internal_sub_dag3 is finished I want etl_adzuna_sub_dag, etl_adwords_sub_dag, etl_facebook_sub_dag, and etl_pagespeed_sub_dag to run in parallel. Finally, when these last four scripts are finished, I want the etl_combine_sub_dag to run.

但是,当我运行主dag时,etl_adzuna_sub_dagetl_adwords_sub_dagetl_facebook_sub_dagetl_pagespeed_sub_dag会一个接一个地运行,而不是并行运行.

However, when I run the main dag, etl_adzuna_sub_dag, etl_adwords_sub_dag, etl_facebook_sub_dag, and etl_pagespeed_sub_dag are run one by one and not in parallel.

问题::如何确保脚本etl_adzuna_sub_dagetl_adwords_sub_dagetl_facebook_sub_dagetl_pagespeed_sub_dag并行运行?

Question: How do I make sure that the scripts etl_adzuna_sub_dag, etl_adwords_sub_dag, etl_facebook_sub_dag, and etl_pagespeed_sub_dag are run in parallel?

我的default_argsDAG看起来像这样:

My default_args and DAG look like this:

default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': start_date,
    'end_date': end_date,
    'email': ['myname@gmail.com'],
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 0,
    'retry_delay': timedelta(minutes=5),
}

DAG_NAME = 'main_dag'

dag = DAG(DAG_NAME, default_args=default_args, catchup = False)

推荐答案

您将需要使用LocalExecutor.

检查您的配置(airflow.cfg),您可能正在使用SequentialExectuor来连续执行任务.

Check your configs (airflow.cfg), you might be using SequentialExectuor which executes tasks serially.

Airflow使用后端数据库存储元数据.检查您的airflow.cfg文件并寻找executor关键字.默认情况下,Airflow使用SequentialExecutor,无论如何,它将依次执行任务.因此,要允许Airflow并行运行任务,您将需要在Postges或MySQL中创建数据库并在airflow.cfg(sql_alchemy_conn参数)中对其进行配置,然后在airflow.cfg中将执行程序更改为LocalExecutor,然后运行airflow initdb.

Airflow uses a Backend database to store metadata. Check your airflow.cfg file and look for executor keyword. By default, Airflow uses SequentialExecutor which would execute task sequentially no matter what. So to allow Airflow to run tasks in Parallel you will need to create a database in Postges or MySQL and configure it in airflow.cfg (sql_alchemy_conn param) and then change your executor to LocalExecutor in airflow.cfg and then run airflow initdb.

请注意,要使用LocalExecutor,您需要使用Postgres或MySQL而不是SQLite作为后端数据库.

Note that for using LocalExecutor you would need to use Postgres or MySQL instead of SQLite as a backend database.

更多信息: https://airflow.incubator.apache.org/howto /initialize-database.html

如果要对Airflow进行真实的测试,则应考虑设置真实的数据库后端并切换到LocalExecutor.由于Airflow是使用强大的SqlAlchemy库构建的,可与其元数据进行交互,因此您应该能够将任何支持的数据库后端用作SqlAlchemy后端.我们建议使用MySQL或Postgres.

If you want to take a real test drive of Airflow, you should consider setting up a real database backend and switching to the LocalExecutor. As Airflow was built to interact with its metadata using the great SqlAlchemy library, you should be able to use any database backend supported as a SqlAlchemy backend. We recommend using MySQL or Postgres.

这篇关于并行运行气流任务/损失的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆