Kafka Streams重新平衡高吞吐量Kafka-streams服务上的延迟延迟 [英] Kafka Streams rebalancing latency spikes on high throughput kafka-streams services

查看:279
本文介绍了Kafka Streams重新平衡高吞吐量Kafka-streams服务上的延迟延迟的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们开始使用Kafka流,我们的服务是一个非常简单的无状态消费者.

we are starting to work with Kafka streams, our service is a very simple stateless consumer.

我们对延迟有严格的要求,当消费者群体进行重新平衡时,我们面临着过高的延迟问题.在我们的方案中,重新平衡将相对频繁地发生:滚动更新代码,扩展/缩减服务,容器被集群调度程序拖移,容器即将死机,硬件故障.

We have tight requirements on latency, and we are facing too high latency problems when the consumer group is rebalancing. In our scenario, rebalancing will happen relatively often: rolling updates of code, scaling up/down the service, containers being shuffled by the cluster scheduler, containers dying, hardware failing.

我们进行的首批测试之一是建立一个小型消费者群体,其中有4个消费者处理少量消息(每秒1K)并杀死其中一个消息.集群管理器(当前为AWS-ECS,可能很快将迁移到K8S)启动了一个新的集群管理器.因此,完成了不止一次的重新平衡.

One of the first tests we have done is having a small consumer group with 4 consumers handling a small amount of messages (1K/sec) and killing one of them; the cluster manager (currently AWS-ECS, probably soon moving to K8S) starts a new one. So, more than one rebalancing is done.

我们最关键的指标是延迟,我们将其度量为发布者中的消息创建与订阅者中的消息消耗之间的毫秒数.我们看到最大的延迟时间从几毫秒增加到了将近15秒.

Our most critical metric is latency, which we measure as the milliseconds between message creation in the publisher and message consumption in the subscriber. We saw the maximum latency spiking from a few milliseconds, to almost 15 seconds.

我们还用一些滚动更新的代码进行了测试,结果更加糟糕,因为我们的部署尚未针对Kafka服务进行准备,并且引发了很多重新平衡.我们需要对此进行研究,但是想知道其他人在以最小的延迟进行代码部署/自动扩展时遵循什么策略.

We also have done tests with some rolling updates of code and the results are worse, since our deployment is not prepared for Kafka services and we trigger a lot of rebalancings. We'll need to work on that, but wondering what are the strategies followed by other people for doing code deployment / autoscaling with the minimum possible delays.

不确定是否有帮助,但是我们对消息处理的要求非常宽松:我们不关心某些消息不时被处理两次,或者对消息的排序非常严格.

Not sure it might help, but our requirements are pretty relaxed related to message processing: we don't care about some messages being processed twice from time to time, or are very strict with the ordering of messages.

我们正在使用所有默认配置,没有进行任何调整.

We are using all default configurations, no tuning.

我们需要改善重新平衡过程中的延迟峰值. 有人可以给我们一些如何工作的提示吗?接触配置足够了吗?我们需要使用一些具体的分区分配器吗?实施我们自己的?

We need to improve this latency spikes during rebalancing. Can someone, please, give us some hints on how to work on it? Is touching configurations enough? Do we need to use some concrete parition Asignor? Implement our own?

在最小的延迟情况下,推荐的代码部署/自动扩展方法是什么?

What is the recommended approach to code deployment / autoscaling with the minimum possible delays?

我们的Kafka版本是1.1.0,在查看例如kafka/kafka_2.11-1.1.0-cp1.jar的库之后,我们安装了Confluent平台4.1.0. 在消费者方面,我们使用的是Kafka-streams 2.1.0.

Our Kafka version is 1.1.0, after looking at libs found for example kafka/kafka_2.11-1.1.0-cp1.jar, we installed Confluent platform 4.1.0. In the consumer side, we are using Kafka-streams 2.1.0.

感谢您阅读我的问题和您的回答.

Thank you for reading my question and your responses.

推荐答案

如果差距主要是由重新平衡引起的,则意味着不触发重新平衡而是仅让AWS/K8进行工作并恢复被退回的实例并支付费用退回期间的不可用时间---请注意,对于无状态实例,这通常会更好;而对于有状态应用程序,则最好确保重新启动的实例可以访问其关联的存储,以便可以保存来自更改日志的引导程序

If the gap is introduced mainly from the rebalance, meaning that not triggering the rebalance but just left AWS / K8s to do their work and resume the bounced instance and pay the unavailability period of time during the bounce --- note that for stateless instances this is usually better, while for stateful applications you'd better make sure the restarted instance can access to its associated storage so that it can save on bootstrapping from the changelog.

要这样做:

在Kafka 1.1中,要减少不必要的重新平衡,您可以增加组的会话超时,以使协调员对成员不发出心跳信号的反应变得不太敏感"-请注意,自0.11开始,我们禁用了Leave.group请求.对于Streams的消费者为0( https://issues.apache.org/jira/browse/KAFKA-4881 ),因此,如果我们有较长的会话超时时间,则离开小组的成员将不会触发重新平衡,尽管重新加入成员仍会触发一个成员.再少平衡一次总比没有平衡好.

In Kafka 1.1, to reduce the unnecessary rebalance you can increase the session timeout of the group so that coordinator became "less sensitive" about members not responding with heartbeats --- note that we disabled the leave.group request since 0.11.0 for Streams' consumers (https://issues.apache.org/jira/browse/KAFKA-4881) so if we have a longer session timeout, the member leaving the group would not trigger rebalance, though member rejoining would still trigger one. Still one rebalance less is better than none.

不过,在即将到来的Kafka 2.2中,我们在优化重新平衡方案方面进行了重大改进,主要是在KIP-345(

In the coming Kafka 2.2 though, we've done a big improvement on optimizing rebalance scenarios, primarily captured in KIP-345 (https://cwiki.apache.org/confluence/display/KAFKA/KIP-345%3A+Introduce+static+membership+protocol+to+reduce+consumer+rebalances). With that much fewer rebalances will be triggered with a rolling bounce, with a reasonable config settings introduced in KIP-345. So I'd strongly recommend you to upgrade to 2.2 and see if it helps your case

这篇关于Kafka Streams重新平衡高吞吐量Kafka-streams服务上的延迟延迟的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆