使用Dataflow与Cloud Composer [英] Using Dataflow vs. Cloud Composer

查看:75
本文介绍了使用Dataflow与Cloud Composer的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想澄清一下Cloud Dataflow还是Cloud Composer是适合该工作的工具,而Google文档中也不清楚。

I'd like to get some clarification on whether Cloud Dataflow or Cloud Composer is the right tool for the job, and I wasn't clear from the Google Documentation.

当前,我正在使用Cloud Dataflow读取非标准的csv文件-做一些基本处理-并将其加载到BigQuery中。

Currently, I'm using Cloud Dataflow to read a non-standard csv file -- do some basic processing -- and load it into BigQuery.

让我给一个非常基本的示例:

Let me give a very basic example:

# file.csv
type\x01date
house\x0112/27/1982
car\x0111/9/1889

从此文件中我们检测到模式并创建BigQuery表,如下所示:

From this file we detect the schema and create a BigQuery table, something like this:

`table`
type (STRING)
date (DATE)

而且,我们还格式化数据以将(以python格式)插入BigQuery:

And, we also format our data to insert (in python) into BigQuery:

DATA = [
    ("house", "1982-12-27"),
    ("car", "1889-9-11")
]

这是一个极大的简化发生了什么,但这就是我们当前使用Cloud Dataflow的方式。

This is a vast simplification of what's going on, but this is how we're currently using Cloud Dataflow.

我的问题是, Cloud Composer 出现在哪里?它可以在上面提供哪些附加功能?换句话说,为什么要在 Cloud Dataflow的顶部使用它?

My question then is, where does Cloud Composer come into the picture? What additional features could it provide on the above? In other words, why would it be used "on top of" Cloud Dataflow?

推荐答案

Cloud composer(由Apache支持)气流)专为小规模任务计划而设计。

Cloud composer(which is backed by Apache Airflow) is designed for tasks scheduling in small scale.

以下是帮助您理解的示例:

Here is an example to help you understand:

假设您在GCS中有一个CSV文件,并使用在您的示例中,假设您使用Cloud Dataflow进行处理并将格式化的数据插入BigQuery。如果这是一次性的事情,那么您刚刚完成并完成了它。

Say you have a CSV file in GCS, and using your example, say you use Cloud Dataflow to process it and insert formatted data into BigQuery. If this is a one-off thing, you have just finished it and its perfect.

现在,假设您的CSV文件每天在世界标准时间01:00被覆盖,并且您希望每次覆盖时都运行相同的Dataflow作业以对其进行处理。如果您不想在周末和节假日的确切时间01:00 UTC手动运行作业,则需要一件东西来定期为您运行该作业(在我们的示例中,每天01:00 UTC)。在这种情况下,Cloud Composer可以为您提供帮助。您可以为Cloud Composer提供配置,其中包括要运行的作业(操作员),何时运行(指定作业开始时间)以及以什么频率(可以每天,每周甚至每年)运行。

Now let's say your CSV file is overwritten at 01:00 UTC every day, and you want to run the same Dataflow job to process it every time when its overwritten. If you don't want to manually run the job exactly at 01:00 UTC regardless of weekends and holidays, you need a thing to periodically run the job for you (in our example, at 01:00 UTC every day). Cloud Composer can help you in this case. You can provide a config to Cloud Composer, which includes what jobs to run (operators), when to run (specify a job start time) and run in what frequency (can be daily, weekly or even yearly).

似乎已经很酷了,但是,如果不是在世界标准时间01:00覆盖CSV文件,而是在一天中的任何时间覆盖,怎么办? Cloud Composer提供了可以监视状态(在这种情况下为CSV文件修改时间)的传感器。 Cloud Composer可以保证只有在满足条件的情况下才能开始工作。

It seems cool already, however, what if the CSV file is overwritten not at 01:00 UTC, but anytime in a day, how will you choose the daily running time? Cloud Composer provides sensors, which can monitor a condition (in this case, the CSV file modification time). Cloud Composer can guarantee that it kicks off a job only if the condition is satisfied.

Cloud Composer / Apache Airflow提供了更多功能,包括具有DAG运行多个作业,失败的任务重试,失败通知和一个不错的仪表板。您还可以从他们的文档中了解更多信息。

There are a lot more features that Cloud Composer/Apache Airflow provide, including having a DAG to run multiple jobs, failed task retry, failure notification and a nice dashboard. You can also learn more from their documentations.

这篇关于使用Dataflow与Cloud Composer的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆