KSQL Table-Table左外部Join多次发出相同的联接结果 [英] KSQL Table-Table Left outer Join emit same join result more than once
问题描述
使用KSQL并执行左外部联接,我可以看到有时发出的联接结果不止一次.
using KSQL, and performing left outer join, i can see the result of my join sometime emitted more than once.
换句话说,相同的联接结果被发射多次.我不是在谈论在右侧具有null值的联接版本和没有null值的版本.从字面上看,一次联接产生的同一记录会发出多次.
In other words, the same join result is emitted more than once. I am not talking about, a version of the join with the null value on the right side and a version without the null value. Literally the same record that result from a join is emitted more than once.
我想知道这是否是预期的行为.
I wonder if that is an expected behaviour.
推荐答案
通常的答案是肯定的.kafka是至少一次的系统.更具体地说,某些情况可能会导致重复:
the general answer is yes. kafka is an at-least-once system. more specifically, a few scenarios can result in duplication:
- 消费者只定期检查自己的位置.消费者崩溃可能导致重复处理某些范围或记录生产者有客户端超时.这意味着生产者可能认为请求已超时,而在代理方实际上成功了之后,重新发送了请求.
- 如果您在kafka群集之间镜像数据,通常是通过某种生产者+消费者对完成的,这可能导致更多重复.
- consumers only periodically checkpoint their positions. a consumer crash can result in duplicate processing of some range or records
- producers have client-side timeouts. this means the producer may think a request timed out and re-transmit while broker-side it actually succeeded.
- if you mirror data between kafka clusters thats usually done with a producer + consumer pair of some sort that can lead to more duplication.
您是否在日志中看到任何此类崩溃/超时?
are you seeing any such crashes/timeouts in your logs?
您可以尝试使用一些kafka功能来减少发生这种情况的可能性:
there are a few kafka features you could try using to reduce the likelihood of this happening to you:
- 在生产者配置中将
enable.idempotence
设置为true(请参阅 https://kafka.apache.org/documentation/#producerconfigs )-会产生一些开销 - 在产生时使用事务-产生开销并增加延迟 在生产者上
- 在生产者上设置
transactional.id
,以防您跨机器故障转移-大规模管理变得复杂 - 在使用者上将
isolation.level
设置为read_committed
-增加延迟(需要与上述2结合使用) - 缩短使用者上的
auto.commit.interval.ms
-只是减少了重复窗口,并没有真正解决任何问题.以非常低的值会产生开销.
- set
enable.idempotence
to true in your producer configs (see https://kafka.apache.org/documentation/#producerconfigs) - incurs some overhead - use transactions when producing - incurs overhead and adds latency
- set
transactional.id
on the producer in case your fail over across machines - gets complicated to manage at scale - set
isolation.level
toread_committed
on the consumer - adds latency (needs to be done in combination with 2 above) - shorten
auto.commit.interval.ms
on the consumer - just reduces the window of duplication, doesnt really solve anything. incurs overhead at really low values.
这篇关于KSQL Table-Table左外部Join多次发出相同的联接结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!