KSQL Table-Table左外部Join多次发出相同的联接结果 [英] KSQL Table-Table Left outer Join emit same join result more than once

查看:62
本文介绍了KSQL Table-Table左外部Join多次发出相同的联接结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用KSQL并执行左外部联接,我可以看到有时发出的联接结果不止一次.

using KSQL, and performing left outer join, i can see the result of my join sometime emitted more than once.

换句话说,相同的联接结果被发射多次.我不是在谈论在右侧具有null值的联接版本和没有null值的版本.从字面上看,一次联接产生的同一记录会发出多次.

In other words, the same join result is emitted more than once. I am not talking about, a version of the join with the null value on the right side and a version without the null value. Literally the same record that result from a join is emitted more than once.

我想知道这是否是预期的行为.

I wonder if that is an expected behaviour.

推荐答案

通常的答案是肯定的.kafka是至少一次的系统.更具体地说,某些情况可能会导致重复:

the general answer is yes. kafka is an at-least-once system. more specifically, a few scenarios can result in duplication:

    消费者只定期检查自己的位置.消费者崩溃可能导致重复处理某些范围或记录生产者有客户端超时.这意味着生产者可能认为请求已超时,而在代理方实际上成功了之后,重新发送了请求.
  1. 如果您在kafka群集之间镜像数据,通常是通过某种生产者+消费者对完成的,这可能导致更多重复.
  1. consumers only periodically checkpoint their positions. a consumer crash can result in duplicate processing of some range or records
  2. producers have client-side timeouts. this means the producer may think a request timed out and re-transmit while broker-side it actually succeeded.
  3. if you mirror data between kafka clusters thats usually done with a producer + consumer pair of some sort that can lead to more duplication.

您是否在日志中看到任何此类崩溃/超时?

are you seeing any such crashes/timeouts in your logs?

您可以尝试使用一些kafka功能来减少发生这种情况的可能性:

there are a few kafka features you could try using to reduce the likelihood of this happening to you:

  1. 在生产者配置中将 enable.idempotence 设置为true(请参阅 https://kafka.apache.org/documentation/#producerconfigs )-会产生一些开销
  2. 在产生时使用事务-产生开销并增加延迟
  3. 在生产者上
  4. 在生产者上设置 transactional.id ,以防您跨机器故障转移-大规模管理变得复杂
  5. 在使用者上将 isolation.level 设置为 read_committed -增加延迟(需要与上述2结合使用)
  6. 缩短使用者上的 auto.commit.interval.ms -只是减少了重复窗口,并没有真正解决任何问题.以非常低的值会产生开销.
  1. set enable.idempotence to true in your producer configs (see https://kafka.apache.org/documentation/#producerconfigs) - incurs some overhead
  2. use transactions when producing - incurs overhead and adds latency
  3. set transactional.id on the producer in case your fail over across machines - gets complicated to manage at scale
  4. set isolation.level to read_committed on the consumer - adds latency (needs to be done in combination with 2 above)
  5. shorten auto.commit.interval.ms on the consumer - just reduces the window of duplication, doesnt really solve anything. incurs overhead at really low values.

这篇关于KSQL Table-Table左外部Join多次发出相同的联接结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆