Scio:使用 Pub/Sub 作为集合源时 groupByKey 不起作用 [英] Scio: groupByKey doesn't work when using Pub/Sub as collection source

查看:19
本文介绍了Scio:使用 Pub/Sub 作为集合源时 groupByKey 不起作用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我更改了 WindowsWordCount 示例 从文本文件到云 Pub/Sub 的程序如下所示.我将莎士比亚文件的数据发布到 Pub/Sub,它确实被正确获取,但 .groupByKey 之后的任何转换似乎都不起作用.

I changed source of WindowsWordCount example program from text file to cloud Pub/Sub as shown below. I published shakespeare file's data to Pub/Sub which did get fetched properly but none of the transformations after .groupByKey seem to work.

sc.pubsubSubscription[String](psSubscription)
  .withFixedWindows(windowSize) // apply windowing logic
  .flatMap(_.split("[^a-zA-Z']+").filter(_.nonEmpty))
  .countByValue
  .withWindow[IntervalWindow]
  .swap
  .groupByKey
  .map {
    s =>
      println("\n\n\n\n\n\n\n This never prints \n\n\n\n\n")
      println(s)
  }

推荐答案

将输入从文本文件更改为 PubSub 的 PCollection 无界".按键分组需要定义聚合触发器,否则 grouper 将永远等待.它在此处的数据流文档中提到:https://cloud.google.com/dataflow/model/group-by-键

Changing the input from a text file to PubSub the PCollection "unbounded". Grouping that by key requires to define aggregation triggers, otherwise the grouper will wait forever. It's mentioned in the dataflow documentation here: https://cloud.google.com/dataflow/model/group-by-key

注意:为了在无界 PCollection 上执行 GroupByKey,需要非全局窗口化或聚合触发器.这是因为有界的 GroupByKey 必须等待所有具有特定键的数据被收集;但是对于无限的集合,数据是无限的.窗口化和/或触发器允许分组对无界数据流中的逻辑、​​有限数据包进行操作.

Note: Either non-global Windowing or an aggregation trigger is required in order to perform a GroupByKey on an unbounded PCollection. This is because a bounded GroupByKey must wait for all the data with a certain key to be collected; but with an unbounded collection, the data is unlimited. Windowing and/or Triggers allow grouping to operate on logical, finite bundles of data within the unbounded data stream.

如果您在未设置非全局窗口策略、触发器策略或两者的情况下将 GroupByKey 应用于无界 PCollection,Dataflow 将在构建管道时生成 IllegalStateException 错误.

If you apply GroupByKey to an unbounded PCollection without setting either a non-global windowing strategy, a trigger strategy, or both, Dataflow will generate an IllegalStateException error when your pipeline is constructed.

不幸的是,Apache Beam 的 Python SDK 似乎不支持触发器(目前),所以我不确定 Python 中的解决方案是什么.

Unfortunately, in the Python SDK of Apache Beam seems not to support triggers (yet), so I'm not sure what the solution would be in python.

(参见 https://beam.apache.org/documentation/programming-指南/#triggers)

这篇关于Scio:使用 Pub/Sub 作为集合源时 groupByKey 不起作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆