Scio:使用Pub/Sub作为集合源时,groupByKey不起作用 [英] Scio: groupByKey doesn't work when using Pub/Sub as collection source

查看:92
本文介绍了Scio:使用Pub/Sub作为集合源时,groupByKey不起作用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我更改了 WindowsWordCount示例程序从文本文件到云Pub/Sub,如下所示.我将莎士比亚文件的数据发布到Pub/Sub,确实可以正确获取,但.groupByKey之后的任何转换似乎都不起作用.

I changed source of WindowsWordCount example program from text file to cloud Pub/Sub as shown below. I published shakespeare file's data to Pub/Sub which did get fetched properly but none of the transformations after .groupByKey seem to work.

sc.pubsubSubscription[String](psSubscription)
  .withFixedWindows(windowSize) // apply windowing logic
  .flatMap(_.split("[^a-zA-Z']+").filter(_.nonEmpty))
  .countByValue
  .withWindow[IntervalWindow]
  .swap
  .groupByKey
  .map {
    s =>
      println("\n\n\n\n\n\n\n This never prints \n\n\n\n\n")
      println(s)
  }

推荐答案

将输入从文本文件更改为PubSub PCollection无界".按键分组需要定义聚合触发器,否则分组器将永远等待.这里的数据流文档中提到了它: https://cloud.google.com/dataflow/model/group-by-键

Changing the input from a text file to PubSub the PCollection "unbounded". Grouping that by key requires to define aggregation triggers, otherwise the grouper will wait forever. It's mentioned in the dataflow documentation here: https://cloud.google.com/dataflow/model/group-by-key

注意:需要非全局窗口或聚合触发器才能对无边界的PCollection执行GroupByKey.这是因为有界的GroupByKey必须等待具有特定键的所有数据被收集.但是对于无限制的集合,数据是无限的.窗口和/或触发器使分组可以对无限制数据流中的逻辑有限数据束进行操作.

Note: Either non-global Windowing or an aggregation trigger is required in order to perform a GroupByKey on an unbounded PCollection. This is because a bounded GroupByKey must wait for all the data with a certain key to be collected; but with an unbounded collection, the data is unlimited. Windowing and/or Triggers allow grouping to operate on logical, finite bundles of data within the unbounded data stream.

如果将GroupByKey应用于无边界的PCollection而不设置非全局窗口策略,触发器策略或两者都不设置,则在构造管道时,Dataflow将生成IllegalStateException错误.

If you apply GroupByKey to an unbounded PCollection without setting either a non-global windowing strategy, a trigger strategy, or both, Dataflow will generate an IllegalStateException error when your pipeline is constructed.

不幸的是,在Apache Beam的Python SDK中,似乎还不支持触发器,因此我不确定在python中将采用什么解决方案.

Unfortunately, in the Python SDK of Apache Beam seems not to support triggers (yet), so I'm not sure what the solution would be in python.

(请参见 https://beam.apache.org/documentation/programming-指南/#triggers )

这篇关于Scio:使用Pub/Sub作为集合源时,groupByKey不起作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆