Google Cloud Dataflow消耗外部资源 [英] Google Cloud Dataflow consume external source

查看：93 发布时间：2020/11/18 1:46:43 python etl google-cloud-dataflow

本文介绍了Google Cloud Dataflow消耗外部资源的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

因此，Dataflow背后的概念有点问题.特别是关于管道的构造方式.

So I am having a bit of a issue with the concepts behind Dataflow. Especially regarding the way the pipelines are supposed to be structured.

我正在尝试使用一个外部API，该API提供了索引XML文件以及指向单独XML文件的链接.一旦获得了所有XML文件的内容，就需要将它们拆分为单独的PCollection，以便可以完成其他PTransforms.

I am trying to consume an external API that delivers an index XML file with links to separate XML files. Once I have the contents of all the XML files I need to split those up into separate PCollections so additional PTransforms can be done.

在可以下载和读取产品XML之前，需要先下载并读取第一个xml文件这一事实使我难以为继.正如文档所述，管道以Source开头，以Sink结尾.

It is hard to wrap my head around the fact that the first xml file needs to be downloaded and read, before the product XML's can be downloaded and read. As the documentation states that a pipeline starts with a Source and ends with a Sink.

所以我的问题是:

对于这种任务，Dataflow甚至是正确的工具吗?
自定义来源是要合并整个过程，还是应该在单独的步骤/管道中完成?
可以在管道中处理此问题，然后让另一个管道读取文件吗?
此过程的高层概述如何?

注意事项:我正在为此使用Python SDK，但这可能并不真正相关，因为这更多是架构问题.

Things to note: I am using the Python SDK for this, but that probably isn't really relevant as this is more a architectural problem.

Google Cloud Dataflow消耗外部资源 [英] Google Cloud Dataflow consume external source

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Google Cloud Dataflow消耗外部资源 [英] Google Cloud Dataflow consume external source

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭