与Google Dataflow的复杂连接 [英] Complex join with google dataflow

查看:97
本文介绍了与Google Dataflow的复杂连接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是新手,试图了解如何将批处理ETL流程重新写入Google Dataflow.我已经阅读了一些文档,并提供了一些示例.

I'm a newbie, trying to understand how we might re-write a batch ETL process into Google Dataflow. I've read some of the docs, run a few examples.

我建议新的ETL流程将由业务事件(即源PCollection)驱动.这些将触发该特定业务实体的ETL流程. ETL过程将从源系统中提取数据集,然后将那些结果(PCollections)传递到下一个处理阶段.处理阶段将涉及各种类型的联接(包括笛卡尔联接和非关键联接,例如带日期的联接).

I'm proposing that the new ETL process would be driven by business events (i.e. a source PCollection). These would trigger the ETL process for that particular business entity. The ETL process would extract datasets from source systems and then pass those results (PCollections) onto the next processing stage. The processing stages would involve various types of joins (including cartesian and non-key joins, e.g. date-banded).

这里有几个问题:

(1)是我提出的有效方法吗?高效的?如果没有更好的选择,我还没有看到任何有关使用Google Dataflow的现实世界中复杂ETL流程的演示,只是简单的场景.

(1) Is the approach that I'm proposing valid & efficient? If not what would be better, I havent seen any presentations on real-world complex ETL processes using Google Dataflow, only simple scenarios.

是否有更合适的高级" ETL产品?我一直在关注Spark和Flink一段时间.

Are there any "higher-level" ETL products that are a better fit? I've been keeping an eye on Spark and Flink for a while.

尽管只有大约30个核心表(经典的EDW维度和事实)和大约1000个转换步骤,但我们当前的ETL还是比较复杂的.源数据很复杂(大约有150个Oracle表).

Our current ETL is moderately complex, though there are only about 30 core tables (classic EDW dimensions and facts), and ~1000 transformation steps. Source data is complex (roughly 150 Oracle tables).

(2)复杂的非键联接,如何处理?

(2) The complex non-key joins, how would these be handled?

显然,我最先被Google Dataflow吸引了,因为它首先是API,并且并行处理功能似乎非常适合(我们被要求从一整夜从批处理转移到增量处理).

I'm obviously attracted to Google Dataflow because of it being an API first and foremost, and the parallel processing capabilities seem a very good fit (we are being asked to move from batch overnight to incremental processing).

针对该用例的一个很好的数据流示例确实可以推动采用率的提高!

A good worked example of Dataflow for this use case would really push adoption forward!

谢谢, 迈克S

推荐答案

听起来Dataflow很合适.我们允许您编写一个接受业务事件PCollection并执行ETL的管道.管道可以是批处理(定期执行),也可以是流式处理(只要输入数据到达就执行).

It sounds like Dataflow would be a good fit. We allow you to write a pipeline that takes a PCollection of business events and performs the ETL. The pipeline could either be batch (executed periodically) or streaming (executed whenever input data arrives).

大多数情况下,各种连接在Dataflow中相对可表达.对于笛卡尔积,您可以使用侧面输入来查看使PCollection的内容可用作输入另一个PCollection中每个元素的处理.

The various joins are for the most part relatively expressible in Dataflow. For the cartesian product, you can look at using side inputs to make the contents of a PCollection available as an input to the processing of each element in another PCollection.

您还可以使用GroupByKey CoGroupByKey 实现联接.这些将多个输入展平,并允许在同一位置使用同一键访问所有值.您还可以使用Combine.perKey计算与某个键关联的所有元素的关联和交换组合(例如SUM,MIN,MAX,AVERAGE等).

You can also look at using GroupByKey or CoGroupByKey to implement the joins. These flatten multiple inputs, and allow accessing all values with the same key in one place. You can also use Combine.perKey to compute associative and commutative combinations of all the elements associated with a key (eg., SUM, MIN, MAX, AVERAGE, etc.).

带日期限制的连接听起来很适合窗口,它允许您要编写一个使用数据窗口的管道(例如,每小时的窗口,每天的窗口,每天滑动的7天的窗口等).

Date-banded joins sound like they would be a good fit for windowing which allows you to write a pipeline that consumes windows of data (eg., hourly windows, daily windows, 7 day windows that slide every day, etc.).

提及GroupByKeyCoGroupByKey.

这篇关于与Google Dataflow的复杂连接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆