使用默认触发器在窗口中使用无界数据 [英] Consuming unbounded data in windows with default trigger

查看：26 发布时间：2021/11/11 22:30:56 google-cloud-dataflow apache-beam

本文介绍了使用默认触发器在窗口中使用无界数据的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个 Pub/Sub 主题 + 订阅，想要消费和聚合无界数据流中订阅的数据.我使用固定窗口并将聚合写入 BigQuery.

I have a Pub/Sub topic + subscription and want to consume and aggregate the unbounded data from the subscription in a Dataflow. I use a fixed window and write the aggregates to BigQuery.

读写(没有窗口和聚合)工作正常.但是当我将数据通过管道传输到固定窗口(以计算每个窗口中的元素)时，该窗口永远不会触发.因此不会写入聚合.

Reading and writing (without windowing and aggregation) works fine. But when I pipe the data into a fixed window (to count the elements in each window) the window is never triggered. And thus the aggregates are not written.

这是我的 word 发布者(它使用了 examples 作为输入文件):

Here is my word publisher (it uses kinglear.txt from the examples as input file):

public static class AddCurrentTimestampFn extends DoFn<String, String> {
    @ProcessElement public void processElement(ProcessContext c) {
        c.outputWithTimestamp(c.element(), new Instant(System.currentTimeMillis()));
    }
}

public static class ExtractWordsFn extends DoFn<String, String> {
    @ProcessElement public void processElement(ProcessContext c) {
        String[] words = c.element().split("[^a-zA-Z']+");
        for (String word:words){ if(!word.isEmpty()){ c.output(word); }}
    }
}

// main:
Pipeline p = Pipeline.create(o); // 'o' are the pipeline options
p.apply("ReadLines", TextIO.Read.from(o.getInputFile()))
        .apply("Lines2Words", ParDo.of(new ExtractWordsFn()))
        .apply("AddTimestampFn", ParDo.of(new AddCurrentTimestampFn()))
        .apply("WriteTopic", PubsubIO.Write.topic(o.getTopic()));
p.run();

这是我的窗口字计数器:

Here is my windowed word counter:

Pipeline p = Pipeline.create(o); // 'o' are the pipeline options

BigQueryIO.Write.Bound tablePipe = BigQueryIO.Write.to(o.getTable(o))
        .withSchema(o.getSchema())
        .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
        .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND);

Window.Bound<String> w = Window
        .<String>into(FixedWindows.of(Duration.standardSeconds(1)));

p.apply("ReadTopic", PubsubIO.Read.subscription(o.getSubscription()))
        .apply("FixedWindow", w)
        .apply("CountWords", Count.<String>perElement())
        .apply("CreateRows", ParDo.of(new WordCountToRowFn()))
        .apply("WriteRows", tablePipe);
p.run();

上述订阅者将不起作用，因为使用默认触发器.但是，如果我手动定义触发器，代码将起作用并且计数将写入 BigQuery.

The above subscriber will not work, since the window does not seem to trigger using the default trigger. However, if I manually define a trigger the code works and the counts are written to BigQuery.

Window.Bound<String> w = Window.<String>into(FixedWindows.of(Duration.standardSeconds(1)))
        .triggering(AfterProcessingTime
                .pastFirstElementInPane()
                .plusDelayOf(Duration.standardSeconds(1)))
        .withAllowedLateness(Duration.ZERO)
        .discardingFiredPanes();

如果可能，我喜欢避免指定自定义触发器.

I like to avoid specifying custom triggers if possible.

问题:

为什么我的解决方案不适用于 Dataflow 的默认触发器?
如何更改发布者或订阅者以使用默认触发器?

使用默认触发器在窗口中使用无界数据 [英] Consuming unbounded data in windows with default trigger

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用默认触发器在窗口中使用无界数据 [英] Consuming unbounded data in windows with default trigger

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭