光束中实际管理水印的是什么? [英] what actually manages watermarks in beam?

查看:84
本文介绍了光束中实际管理水印的是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Beam的强大功能来自其先进的开窗功能,但也有些令人困惑.

Beam's big power comes from it's advanced windowing capabilities, but it's also a bit confusing.

在本地测试(我使用rabbitmq作为输入源)中看到了一些奇怪的地方(消息并不总是得到ack d,并且固定的窗口并不总是关闭的),我开始研究StackOverflow和Beam代码库.

Having seen some oddities in local tests (I use rabbitmq for an input Source) where messages were not always getting ackd, and fixed windows that were not always closing, I started digging around StackOverflow and the Beam code base.

在设置确切的水印时,似乎有特定于源的问题:

It seems there are Source-specific concerns with when exactly watermarks are set:

  • RabbitMQ watermark does not advance: Apache Beam : RabbitMqIO watermark doesn't advance
  • PubSub watermark does not advance for low volumes: https://issues.apache.org/jira/browse/BEAM-7322
  • SQS IO does not advance the watermark over a period of time of no new incoming messages - https://github.com/apache/beam/blob/c2f0d282337f3ae0196a7717712396a5a41fdde1/sdks/java/io/amazon-web-services/src/main/java/org/apache/beam/sdk/io/aws/sqs/SqsIO.java#L44

(和其他).此外,似乎有与Watermarks相反的Checkpoint s(CheckpointMark s)独立概念.

(and others). Further, there seem to be independent notions of Checkpoints (CheckpointMarks) as oppose to Watermarks.

所以我想这是一个多部分的问题:

So I suppose this is a multi-part question:

  1. 什么代码负责移动水印?它似乎是Source和Runner的某种组合,但我似乎无法真正查找来更好地理解它(或针对我们的用例进行调整).对我来说,这是一个特别的问题,因为在音量低的时候,水印永远不会前进,消息也不会ack d
  2. 我没有太多关于Checkpoint/Checkpoint标记在概念上的文档(非代码Beam文档未对此进行讨论). CheckpointMark与水印如何相互作用(如果有的话)?
  1. What code is responsible for moving the watermark? It seems to be some combination of the Source and the Runner, but I can't seem to actually find it to understand it better (or tweak it for our use cases). It is a particular issue for me as in periods of low volume the watermark never advances and messages are not ackd
  2. I don't see much documentation around what a Checkpoint/Checkpoint mark is conceptually (the non-code Beam documentation doesn't discuss it). How does a CheckpointMark interact with a Watermark, if at all?

推荐答案

  1. 每个PCollection都有自己的水印.水印指示该特定PCollection 的完整性.源负责它产生的PCollection的水印.水印向下游PCollections的传播是自动的,没有其他近似方法;它可以大致理解为输入PCollections和缓冲状态的最小值".因此,根据您的情况,查找水印问题为RabbitMqIO.我对这个特定的IO连接器不熟悉,但是如果您尚未执行此操作,则可以提交错误报告或将电子邮件发送给用户列表.
  2. 检查点是特定于源的数据,只要该检查点在跑步者的持久坚持下,它就可以继续读取而不会丢失任何消息.消息ACK倾向于在检查点终结中发生,因为运行程序在知道永远不需要重新读取消息时会调用此方法.
  1. Each PCollection has its own watermark. The watermark indicates how complete that particular PCollection is. The source is responsible for the watermark of the PCollection that it produces. The propagation of watermarks to downstream PCollections is automatic with no additional approximation; it can be roughly understood as "the minimum of input PCollections and buffered state". So in your case, it is RabbitMqIO to look at for watermark problems. I am not familiar with this particular IO connector, but a bug report or email to the user list would be good if you have not already done this.
  2. A checkpoint is a source-specific piece of data that allows it to resume reading without missed messages, as long as the checkpoint is durably persisted by the runner. Message ACK tends to happen in checkpoint finalization, since the runner calls this method when it is known that the message never needs to be re-read.

这篇关于光束中实际管理水印的是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆