只有在Google Dataflow上的另一个管道完成后才执行管道 [英] Executing a pipeline only after another one finishes on google dataflow

查看:66
本文介绍了只有在Google Dataflow上的另一个管道完成后才执行管道的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在Google数据流上运行一个管道,该管道取决于另一个管道的输出.现在,我只在本地使用DirectRunner并排运行两个管道:

I want to run a pipeline on google dataflow that depends on the output of another pipeline. Right now I am just running two pipelines after each other with the DirectRunner locally:

with beam.Pipeline(options=pipeline_options) as p:
    (p
     | beam.io.ReadFromText(known_args.input)
     | SomeTransform()
     | beam.io.WriteToText('temp'))

with beam.Pipeline(options=pipeline_options) as p:
    (p
     | beam.io.ReadFromText('temp*')
     | AnotherTransform()
     | beam.io.WriteToText(known_args.output))

我的问题如下:

  • DataflowRunner是否保证仅在第一个管道完成后才启动第二个管道?
  • 是否存在一种首选的方法来依次运行两个管道?
  • 还有一种建议的方法可以将这些管道分离到不同的文件中,以便对其进行更好的测试?

推荐答案

DataflowRunner是否保证第二个仅在第一个管道完成后才启动?

Does the DataflowRunner guarantee that the second is only started after the first pipeline completed?

否,Dataflow仅执行管道.它没有用于管理依赖管道执行的功能.

No, Dataflow simply executes a pipeline. It has no features for managing dependent pipeline executions.

更新:为澄清起见,Apache Beam确实提供了一种等待管道完成执行的机制.请参见PipelineResult类的waitUntilFinish()方法.参考: PipelineResult.waitUntilFinish().

Update: For clarification, Apache Beam does provide a mechanism for waiting for a pipeline to complete execution. See the waitUntilFinish() method of the PipelineResult class. Reference: PipelineResult.waitUntilFinish().

是否存在一种首选的方法来依次运行两个管道?

Is there a preferred method to run two pipelines consequetively after each other?

考虑使用Apache Airflow之类的工具来管理相关管道.您甚至可以实现一个简单的bash脚本,以在另一个管道完成之后部署一个管道.

Consider using a tool like Apache Airflow for managing dependent pipelines. You can even implement a simple bash script to deploy one pipeline after another one has finished.

还有一种推荐的方法可以将这些管道分成不同的文件,以便更好地对其进行测试?

Also is there a recommended way to separate those pipelines into different files in order to test them better?

是的,单独的文件.这只是良好的代码组织,不一定更好地进行测试.

Yes, separate files. This is just good code organization, not necessarily better for testing.

这篇关于只有在Google Dataflow上的另一个管道完成后才执行管道的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆