使用默认触发器在Windows中使用无边界数据 [英] Consuming unbounded data in windows with default trigger

查看：104 发布时间：2020/9/3 4:57:35 google-cloud-dataflow apache-beam

本文介绍了使用默认触发器在Windows中使用无边界数据的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个发布/订阅主题+订阅，并且想使用和汇总无边界订阅中的数据流中的数据.我使用固定的窗口并将聚合写入BigQuery.

I have a Pub/Sub topic + subscription and want to consume and aggregate the unbounded data from the subscription in a Dataflow. I use a fixed window and write the aggregates to BigQuery.

读取和写入(没有窗口和聚合)可以正常工作.但是，当我将数据通过管道传输到固定窗口(以计算每个窗口中的元素)时，该窗口永远不会触发.因此，聚合不被写入.

Reading and writing (without windowing and aggregation) works fine. But when I pipe the data into a fixed window (to count the elements in each window) the window is never triggered. And thus the aggregates are not written.

这是我的文字发布者(它使用示例作为输入文件):

Here is my word publisher (it uses kinglear.txt from the examples as input file):

public static class AddCurrentTimestampFn extends DoFn<String, String> {
    @ProcessElement public void processElement(ProcessContext c) {
        c.outputWithTimestamp(c.element(), new Instant(System.currentTimeMillis()));
    }
}

public static class ExtractWordsFn extends DoFn<String, String> {
    @ProcessElement public void processElement(ProcessContext c) {
        String[] words = c.element().split("[^a-zA-Z']+");
        for (String word:words){ if(!word.isEmpty()){ c.output(word); }}
    }
}

// main:
Pipeline p = Pipeline.create(o); // 'o' are the pipeline options
p.apply("ReadLines", TextIO.Read.from(o.getInputFile()))
        .apply("Lines2Words", ParDo.of(new ExtractWordsFn()))
        .apply("AddTimestampFn", ParDo.of(new AddCurrentTimestampFn()))
        .apply("WriteTopic", PubsubIO.Write.topic(o.getTopic()));
p.run();

这是我的窗口式文字计数器:

Here is my windowed word counter:

Pipeline p = Pipeline.create(o); // 'o' are the pipeline options

BigQueryIO.Write.Bound tablePipe = BigQueryIO.Write.to(o.getTable(o))
        .withSchema(o.getSchema())
        .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
        .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND);

Window.Bound<String> w = Window
        .<String>into(FixedWindows.of(Duration.standardSeconds(1)));

p.apply("ReadTopic", PubsubIO.Read.subscription(o.getSubscription()))
        .apply("FixedWindow", w)
        .apply("CountWords", Count.<String>perElement())
        .apply("CreateRows", ParDo.of(new WordCountToRowFn()))
        .apply("WriteRows", tablePipe);
p.run();

上述订阅者将无法使用，因为该窗口似乎无法使用

The above subscriber will not work, since the window does not seem to trigger using the default trigger. However, if I manually define a trigger the code works and the counts are written to BigQuery.

Window.Bound<String> w = Window.<String>into(FixedWindows.of(Duration.standardSeconds(1)))
        .triggering(AfterProcessingTime
                .pastFirstElementInPane()
                .plusDelayOf(Duration.standardSeconds(1)))
        .withAllowedLateness(Duration.ZERO)
        .discardingFiredPanes();

我希望避免在可能的情况下指定自定义触发器.

I like to avoid specifying custom triggers if possible.

问题:

为什么我的解决方案不能与Dataflow的默认触发器一起使用?
如何使用默认触发器?

Why does my solution not work with Dataflow's default trigger?
How do I have to change my publisher or subscriber to trigger windows using the default trigger?

使用默认触发器在Windows中使用无边界数据 [英] Consuming unbounded data in windows with default trigger

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用默认触发器在Windows中使用无边界数据 [英] Consuming unbounded data in windows with default trigger

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭