DAG(有向无环图)动态作业调度程序 [英] DAG(directed acyclic graph) dynamic job scheduler

查看:687
本文介绍了DAG(有向无环图)动态作业调度程序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要管理ETL任务的大型工作流,​​其执行取决于时间,数据可用性或外部事件.在执行工作流期间,某些作业可能会失败,并且系统应该能够重新启动失败的工作流分支,而不必等待整个工作流完成执行.

I need to manage a large workflow of ETL tasks, which execution depends on time, data availability or an external event. Some jobs may fail during execution of the workflow and the system should have the ability to restart a failed workflow branch without waiting for whole workflow to finish execution.

python中是否有任何可以处理此问题的框架?

Are there any frameworks in python that can handle this?

我看到了几个核心功能:

I see several core functions:

  • DAG建筑物
  • 执行节点(通过等待,记录等运行shell cmd)
  • 执行期间能够在父DAG中重建子图
  • 能够在父图运行时手动执行节点或子图
  • 在等待外部事件时挂起图形执行
  • 列出作业队列和作业详细信息

类似 Oozie 的东西,但更通用,且使用python.

Something like Oozie, but more general purpose and in python.

推荐答案

1)您可以给 dagobah 尝试一下,如其github页面上所述:Dagobah是一个简单的基于Python的基于依赖项的作业调度程序. Dagobah允许您使用Cron语法安排定期作业.然后,每个作业将按照依存关系图定义的顺序启动一系列任务(子流程),您可以通过在Web界面中单击并拖动来轻松地绘制依存关系图.与以下三个项目相比,这是最轻量的调度程序项目.

1) You can give dagobah a try, as described on its github page: Dagobah is a simple dependency-based job scheduler written in Python. Dagobah allows you to schedule periodic jobs using Cron syntax. Each job then kicks off a series of tasks (subprocesses) in an order defined by a dependency graph you can easily draw with click-and-drag in the web interface. This is the most lightweight scheduler project comparing with the three followings.

2)关于ETL任务,由Spotify开源的 luigi ,如下所述:Luigi是一个Python模块,可帮助您构建复杂的批处理作业管道.它处理依赖关系解析,工作流管理,可视化等.它还内置了Hadoop支持.

2) In terms of ETL tasks, luigi which is open sourced by Spotify focus more on hadoop jobs, as described: Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.

这两个模块都主要用Python编写,并且包含Web界面以便于管理.

Both of the two modules are mainly written in Python and web interfaces are included for convenient management.

据我所知,"luigi"没有提供作业任务的调度程序模块,我认为这对于ETL任务是必需的.但是使用"luigi"更容易用Python编写map-reduce代码,Spotify每天运行成千上万的任务都依赖于它.

As far as I know, 'luigi' doesn't provide a scheduler module for job tasks, which I think is necessary for ETL tasks. But using 'luigi' is more easy to write map-reduce code in Python and thousands of tasks every day at Spotify run depend on it.

3)像luigi一样,Pinterest开源了一个名为 Pinball 的工作流管理器. Pinball的体系结构遵循主工作人员(或主客户,以避免与下面将介绍的特殊类型的客户端混淆)的范式,其中有状态的中央主服务器充当无状态客户端当前系统状态的真实来源.而且它可以平滑集成hadoop/hive/spark作业.

3) Like luigi, Pinterest open sourced their a workflow manager named Pinball. Pinball’s architecture follows a master-worker (or master-client to avoid naming confusion with a special type of client that we introduce below) paradigm where the stateful central master acts as a source of truth about the current system state to stateless clients. And it integrate hadoop/hive/spark jobs smoothly.

4) Airflow ,又是由Airbnb提供的另一项工作进度计划项目,与Luigi和弹球.后端基于Flask,Celery等构建.根据示例工作代码,Airflow功能强大且易于使用在我身边.

4) Airflow, yet another dag job schedule project open sourced by Airbnb, is quite like Luigi and Pinball. The backend is build on Flask, Celery and so on. According to the example job code, Airflow is both powerful and easy to use by my side.

最后但并非最不重要的一点是,Luigi,Airflow和Pinball可能会被更广泛地使用.这三个之间有一个很好的比较: http://bytepawn.com/luigi-airflow-pinball .html

Last but not least, Luigi, Airflow and Pinball may be more widely used. And there is a great comparison among these three: http://bytepawn.com/luigi-airflow-pinball.html

这篇关于DAG(有向无环图)动态作业调度程序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆