仅在另一个管道完成谷歌数据流后才执行管道 [英] Executing a pipeline only after another one finishes on google dataflow

查看:28
本文介绍了仅在另一个管道完成谷歌数据流后才执行管道的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在谷歌数据流上运行一个依赖于另一个管道输出的管道.现在我只是在本地使用 DirectRunner 一个接一个地运行两个管道:

I want to run a pipeline on google dataflow that depends on the output of another pipeline. Right now I am just running two pipelines after each other with the DirectRunner locally:

with beam.Pipeline(options=pipeline_options) as p:
    (p
     | beam.io.ReadFromText(known_args.input)
     | SomeTransform()
     | beam.io.WriteToText('temp'))

with beam.Pipeline(options=pipeline_options) as p:
    (p
     | beam.io.ReadFromText('temp*')
     | AnotherTransform()
     | beam.io.WriteToText(known_args.output))

我的问题如下:

  • DataflowRunner 是否保证第二个管道仅在第一个管道完成后启动?
  • 是否有一种首选方法可以依次运行两个管道?
  • 还有什么推荐的方法可以将这些管道分离到不同的文件中,以便更好地测试它们吗?

推荐答案

DataflowRunner 是否保证在第一个管道完成后才启动第二个?

Does the DataflowRunner guarantee that the second is only started after the first pipeline completed?

不,Dataflow 只是执行管道.它没有管理相关管道执行的功能.

No, Dataflow simply executes a pipeline. It has no features for managing dependent pipeline executions.

更新:为了澄清起见,Apache Beam 确实提供了一种等待管道完成执行的机制.请参阅 PipelineResult 类的 waitUntilFinish() 方法.参考:PipelineResult.waitUntilFinish().

Update: For clarification, Apache Beam does provide a mechanism for waiting for a pipeline to complete execution. See the waitUntilFinish() method of the PipelineResult class. Reference: PipelineResult.waitUntilFinish().

是否有一种首选方法可以依次运行两个管道?

Is there a preferred method to run two pipelines consequetively after each other?

考虑使用 Apache Airflow 之类的工具来管理相关管道.您甚至可以实现一个简单的 bash 脚本,在另一个管道完成后部署另一个管道.

Consider using a tool like Apache Airflow for managing dependent pipelines. You can even implement a simple bash script to deploy one pipeline after another one has finished.

还有推荐的方法将这些管道分成不同的文件以便更好地测试它们吗?

Also is there a recommended way to separate those pipelines into different files in order to test them better?

是的,单独的文件.这只是良好的代码组织,不一定更适合测试.

Yes, separate files. This is just good code organization, not necessarily better for testing.

这篇关于仅在另一个管道完成谷歌数据流后才执行管道的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆