为建立基于hadoop的数据管道安排工具的建议 [英] Suggestion for scheduling tool(s) for building hadoop based data pipelines

查看:192
本文介绍了为建立基于hadoop的数据管道安排工具的建议的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Apache Oozie, Spotify / Luigi airbnb / airflow ,他们每个人的优点和缺点是什么?

过去我使用oozie和airflow来建立一个使用PIG和Hive的数据摄入管道。目前,我正在构建一个查看日志并提取有用事件并将其置于红移的管道。

我发现气流更容易使用/测试/设置。它有一个更酷的用户界面,并允许用户从用户界面本身执行操作,这与Oozie不同。欢迎任何有关Luigi或其他有关稳定性和问题的见解。

解决方案

Azkaban:非常好的用户界面,相对简单,非程序员可以访问。在LinkedIn有悠久的历史。


  • 气流:Decent UI,Python-ish作业定义,非程序员半访问,依赖声明语法很奇怪。 li>
  • Luigi:好的用户界面,工作流程都是纯粹的Python,需要牢牢掌握Python编码和面向对象的概念,因此不适合非程序员。
  • Oozie :疯狂的基于XML的作业定义。这里是龙。 ; - )



  • 恕我直言,Azkaban强制执行简单性(不能使用不存在的功能)



    更简单的流水线比复杂的流水线更好:更容易创建,更容易理解(特别是当您没有创建时)并且更易于调试/修复。



    当需要复杂的操作时,您希望以完全成功或完全失败的方式封装它们。



    如果你可以让它变成幂等的(再次运行它会产生相同的结果),那就更好了。


    Between Apache Oozie, Spotify/Luigi and airbnb/airflow, what are the pros and cons for each of them?

    I have used oozie and airflow in the past for building a data ingestion pipeline using PIG and Hive. Currently, I am in the process of building a pipeline that looks at logs and extracts out useful events and puts them on redshift.

    I found that airflow was much easier to use/test/setup. It has a much cooler UI and lets users perform actions from the UI itself, which is not the case with Oozie. Any information about Luigi or other insights regarding stability and issues are welcome.

    解决方案

    • Azkaban: Nice UI, relatively simple, accessible for non-programmers. Has a longish history at LinkedIn.
    • Airflow: Decent UI, Python-ish job definition, semi-accessible for non-programmers, dependency declaration syntax is weird.
    • Luigi: OK UI, workflows are pure Python, requires solid grasp of Python coding and object oriented concepts, hence not suitable for non-programmers.
    • Oozie: Insane XML based job definitions. Here be dragons. ;-)

    IMHO, Azkaban enforces simplicity (can’t use features that don’t exist) and the others subtly encourage complexity.

    Simpler pipelines are better than complex pipelines: Easier to create, easier to understand (especially when you didn’t create) and easier to debug/fix.

    When complex actions are needed you want to encapsulate them in a way that either completely succeeds or completely fails.

    If you can make it idempotent (running it again creates identical results) then that’s even better.

    这篇关于为建立基于hadoop的数据管道安排工具的建议的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆