与Google Dataflow的复杂连接 [英] Complex join with google dataflow

查看：97 发布时间：2019/9/19 16:16:51 join etl google-cloud-dataflow

本文介绍了与Google Dataflow的复杂连接的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我是新手，试图了解如何将批处理ETL流程重新写入Google Dataflow.我已经阅读了一些文档，并提供了一些示例.

I'm a newbie, trying to understand how we might re-write a batch ETL process into Google Dataflow. I've read some of the docs, run a few examples.

我建议新的ETL流程将由业务事件(即源PCollection)驱动.这些将触发该特定业务实体的ETL流程. ETL过程将从源系统中提取数据集，然后将那些结果(PCollections)传递到下一个处理阶段.处理阶段将涉及各种类型的联接(包括笛卡尔联接和非关键联接，例如带日期的联接).

I'm proposing that the new ETL process would be driven by business events (i.e. a source PCollection). These would trigger the ETL process for that particular business entity. The ETL process would extract datasets from source systems and then pass those results (PCollections) onto the next processing stage. The processing stages would involve various types of joins (including cartesian and non-key joins, e.g. date-banded).

这里有几个问题:

(1)是我提出的有效方法吗?高效的?如果没有更好的选择，我还没有看到任何有关使用Google Dataflow的现实世界中复杂ETL流程的演示，只是简单的场景.

(1) Is the approach that I'm proposing valid & efficient? If not what would be better, I havent seen any presentations on real-world complex ETL processes using Google Dataflow, only simple scenarios.

是否有更合适的高级" ETL产品?我一直在关注Spark和Flink一段时间.

Are there any "higher-level" ETL products that are a better fit? I've been keeping an eye on Spark and Flink for a while.

尽管只有大约30个核心表(经典的EDW维度和事实)和大约1000个转换步骤，但我们当前的ETL还是比较复杂的.源数据很复杂(大约有150个Oracle表).

Our current ETL is moderately complex, though there are only about 30 core tables (classic EDW dimensions and facts), and ~1000 transformation steps. Source data is complex (roughly 150 Oracle tables).

(2)复杂的非键联接，如何处理?

(2) The complex non-key joins, how would these be handled?

显然，我最先被Google Dataflow吸引了，因为它首先是API，并且并行处理功能似乎非常适合(我们被要求从一整夜从批处理转移到增量处理).

I'm obviously attracted to Google Dataflow because of it being an API first and foremost, and the parallel processing capabilities seem a very good fit (we are being asked to move from batch overnight to incremental processing).

针对该用例的一个很好的数据流示例确实可以推动采用率的提高！

A good worked example of Dataflow for this use case would really push adoption forward!

谢谢，迈克S

与Google Dataflow的复杂连接 [英] Complex join with google dataflow

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

与Google Dataflow的复杂连接 [英] Complex join with google dataflow

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭