Apache Airflow或Apache Beam用于数据处理和作业调度 [英] Apache Airflow or Apache Beam for data processing and job scheduling

查看:160
本文介绍了Apache Airflow或Apache Beam用于数据处理和作业调度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试提供有用的信息,但我远不是一名数据工程师.

I'm trying to give useful information but I am far from being a data engineer.

我目前正在使用python库pandas对我的数据执行一连串的转换,该数据具有很多输入(当前为CSV和excel文件).输出是几个excel文件.我希望能够以每月一次的并行计算执行计划的,受监视的批处理作业(我的意思是不像我对熊猫所做的那样顺序执行).

I am currently using the python library pandas to execute a long series of transformation to my data which has a lot of inputs (currently CSV and excel files). The outputs are several excel files. I would like to be able to execute scheduled monitored batch jobs with parallel computation (I mean not as sequential as what I'm doing with pandas), once a month.

我不太了解Beam或Airflow,我迅速阅读了文档,看来两者都可以实现.我应该使用哪一个?

I don't really know Beam or Airflow, I quickly read through the docs and it seems that both can achieve that. Which one should I use ?

推荐答案

Apache Airflow 不是数据处理引擎.

Apache Airflow is not a data processing engine.

Airflow是一个平台,可以以编程方式编写,安排和 监控工作流程.

Airflow is a platform to programmatically author, schedule, and monitor workflows.

Cloud Dataflow 是Google Cloud上的一项完全托管的服务,可用于数据处理.您可以编写您的Dataflow代码,然后使用Airflow计划和监视Dataflow作业.如果工作失败,Airflow还允许您重试作业(重试次数是可配置的).如果您想通过Slack或电子邮件发送警报,或者数据流管道失败,也可以在Airflow中进行配置.

Cloud Dataflow is a fully-managed service on Google Cloud that can be used for data processing. You can write your Dataflow code and then use Airflow to schedule and monitor Dataflow job. Airflow also allows you to retry your job if it fails (number of retries is configurable). You can also configure in Airflow if you want to send alerts on Slack or email, if your Dataflow pipeline fails.

这篇关于Apache Airflow或Apache Beam用于数据处理和作业调度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆