Apache Beam:具有无限源的批处理管道 [英] Apache Beam: Batch Pipeline with Unbounded Source

查看:76
本文介绍了Apache Beam:具有无限源的批处理管道的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在将Apache Beam与Google Dataflow结合使用来处理实时数据.数据来自无限制的Google PubSub,因此当前我正在使用流式传输管道.然而,事实证明,具有24/7/24运行的流传输管道非常昂贵.为了降低成本,我正在考虑切换到以固定时间间隔(例如每30分钟)运行的批处理管道,因为对于用户而言,实时处理并不十分重要.

我想知道是否可以使用PubSub订阅作为有界源?我的想法是,每次运行作业时,它将在触发之前累积1分钟的数据.到目前为止,这似乎是不可能的,但是我遇到了一个名为

我尝试执行以下操作,但作业仍以流模式运行:

PCollection<MyData> data = pipeline
            .apply("ReadData", PubsubIO
                    .readMessagesWithAttributes()
                    .fromSubscription(options.getInput()))
            .apply("ParseData", ParDo.of(new ParseMyDataFn()))

            // Is there a way to make the window trigger once and turning it into a bounded source?
            .apply("Window", Window
                    .<MyData>into(new GlobalWindows())
                    .triggering(AfterProcessingTime
                        .pastFirstElementInPane()
                        .plusDelayOf(Duration.standardMinutes(1))
                    )
                    .withAllowedLateness(Duration.ZERO).discardingFiredPanes()
            );

当前在PubsubIO中没有明确支持,但是您可以尝试定期启动流作业,并在几分钟后以编程方式对其调用Drain. >

I'm currently using Apache Beam with Google Dataflow for processing real time data. The data comes from Google PubSub, which is unbounded, so currently I'm using streaming pipeline. However, it turns out that having a streaming pipeline running 24/7 is quite expensive. To reduce cost, I'm thinking of switching to a batch pipeline that runs at a fixed time interval (e.g. every 30 minutes), since it's not really important for the processing to be real time for the user.

I'm wondering if it's possible to use PubSub subscription as a bounded source? My idea is that each time the job is run, it will accumulate the data for a 1 minute before triggering. So far it does not seem possible, but I've come across a class called BoundedReadFromUnboundedSource (which I have no idea how to use), so maybe there is a way?

Below is roughly how the source looks like:

PCollection<MyData> data = pipeline
            .apply("ReadData", PubsubIO
                    .readMessagesWithAttributes()
                    .fromSubscription(options.getInput()))
            .apply("ParseData", ParDo.of(new ParseMyDataFn()))
            .apply("Window", Window
                    .<MyData>into(new GlobalWindows())
                    .triggering(Repeatedly
                            .forever(AfterProcessingTime
                                    .pastFirstElementInPane()
                                    .plusDelayOf(Duration.standardSeconds(5))
                            )
                    )
                    .withAllowedLateness(Duration.ZERO).discardingFiredPanes()
            );

I tried to do the following, but the job still runs in streaming mode:

PCollection<MyData> data = pipeline
            .apply("ReadData", PubsubIO
                    .readMessagesWithAttributes()
                    .fromSubscription(options.getInput()))
            .apply("ParseData", ParDo.of(new ParseMyDataFn()))

            // Is there a way to make the window trigger once and turning it into a bounded source?
            .apply("Window", Window
                    .<MyData>into(new GlobalWindows())
                    .triggering(AfterProcessingTime
                        .pastFirstElementInPane()
                        .plusDelayOf(Duration.standardMinutes(1))
                    )
                    .withAllowedLateness(Duration.ZERO).discardingFiredPanes()
            );

解决方案

This is not explicitly supported in PubsubIO currently, however you could try periodically starting a streaming job and programmatically invoking Drain on it a few minutes later.

这篇关于Apache Beam:具有无限源的批处理管道的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆