Kafka Streams 重新平衡高吞吐量 kafka-streams 服务的延迟峰值 [英] Kafka Streams rebalancing latency spikes on high throughput kafka-streams services

查看:40
本文介绍了Kafka Streams 重新平衡高吞吐量 kafka-streams 服务的延迟峰值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们开始使用 Kafka 流,我们的服务是一个非常简单的无状态消费者.

we are starting to work with Kafka streams, our service is a very simple stateless consumer.

我们对延迟有严格的要求,当消费者群体重新平衡时,我们面临着过高的延迟问题.在我们的场景中,重新平衡会相对频繁地发生:滚动更新代码、扩展/缩减服务、容器被集群调度程序改组、容器死亡、硬件故障.

We have tight requirements on latency, and we are facing too high latency problems when the consumer group is rebalancing. In our scenario, rebalancing will happen relatively often: rolling updates of code, scaling up/down the service, containers being shuffled by the cluster scheduler, containers dying, hardware failing.

我们所做的第一个测试是让一个包含 4 个消费者的小型消费者组处理少量消息(1K/秒)并杀死其中一个;集群管理器(目前是 AWS-ECS,可能很快会转移到 K8S)开始一个新的.这样,就完成了不止一次的重新平衡.

One of the first tests we have done is having a small consumer group with 4 consumers handling a small amount of messages (1K/sec) and killing one of them; the cluster manager (currently AWS-ECS, probably soon moving to K8S) starts a new one. So, more than one rebalancing is done.

我们最关键的指标是延迟,我们将其衡量为发布者中创建消息和订阅者中消息消耗之间的毫秒数.我们看到最大延迟从几毫秒飙升至近 15 秒.

Our most critical metric is latency, which we measure as the milliseconds between message creation in the publisher and message consumption in the subscriber. We saw the maximum latency spiking from a few milliseconds, to almost 15 seconds.

我们也做了一些代码滚动更新的测试,结果更糟,因为我们的部署没有为 Kafka 服务做好准备,我们触发了很多重新平衡.我们需要为此努力,但想知道其他人采用什么策略以尽可能少的延迟进行代码部署/自动缩放.

We also have done tests with some rolling updates of code and the results are worse, since our deployment is not prepared for Kafka services and we trigger a lot of rebalancings. We'll need to work on that, but wondering what are the strategies followed by other people for doing code deployment / autoscaling with the minimum possible delays.

不确定它是否会有所帮助,但我们的要求与消息处理相关的非常宽松:我们不关心某些消息不时被处理两次,或者对消息的排序非常严格.

Not sure it might help, but our requirements are pretty relaxed related to message processing: we don't care about some messages being processed twice from time to time, or are very strict with the ordering of messages.

我们使用所有默认配置,没有调整.

We are using all default configurations, no tuning.

我们需要在重新平衡期间改善这种延迟峰值.有人可以给我们一些关于如何处理它的提示吗?动人的配置就够了吗?我们是否需要使用一些具体的分区分配器?实现我们自己的?

We need to improve this latency spikes during rebalancing. Can someone, please, give us some hints on how to work on it? Is touching configurations enough? Do we need to use some concrete parition Asignor? Implement our own?

在尽可能减少延迟的情况下,推荐的代码部署/自动缩放方法是什么?

What is the recommended approach to code deployment / autoscaling with the minimum possible delays?

我们的 Kafka 版本是 1.1.0,在查看了 kafka/kafka_2.11-1.1.0-cp1.jar 等库后,我们安装了 Confluent 平台 4.1.0.在消费者方面,我们使用的是 Kafka-streams 2.1.0.

Our Kafka version is 1.1.0, after looking at libs found for example kafka/kafka_2.11-1.1.0-cp1.jar, we installed Confluent platform 4.1.0. In the consumer side, we are using Kafka-streams 2.1.0.

感谢您阅读我的问题和您的回复.

Thank you for reading my question and your responses.

推荐答案

如果差距主要是从 rebalance 引入的,意味着不会触发 rebalance 而是让 AWS/K8s 做他们的工作并恢复被反弹的实例并支付反弹期间的不可用时间段 --- 请注意,对于无状态实例,这通常更好,而对于有状态应用程序,您最好确保重新启动的实例可以访问其关联的存储,以便它可以节省从更改日志引导的时间.

If the gap is introduced mainly from the rebalance, meaning that not triggering the rebalance but just left AWS / K8s to do their work and resume the bounced instance and pay the unavailability period of time during the bounce --- note that for stateless instances this is usually better, while for stateful applications you'd better make sure the restarted instance can access to its associated storage so that it can save on bootstrapping from the changelog.

这样做:

在 Kafka 1.1 中,为了减少不必要的重新平衡,您可以增加组的会话超时,以便协调器对成员不响应心跳变得不那么敏感"---请注意,我们自 0.11 起禁用了 leave.group 请求.Streams 的消费者为 0 (https://issues.apache.org/jira/browse/KAFKA-4881) 所以如果我们有更长的会话超时,离开组的成员不会触发重新平衡,尽管成员重新加入仍然会触发一个.少再平衡总比没有好.

In Kafka 1.1, to reduce the unnecessary rebalance you can increase the session timeout of the group so that coordinator became "less sensitive" about members not responding with heartbeats --- note that we disabled the leave.group request since 0.11.0 for Streams' consumers (https://issues.apache.org/jira/browse/KAFKA-4881) so if we have a longer session timeout, the member leaving the group would not trigger rebalance, though member rejoining would still trigger one. Still one rebalance less is better than none.

不过,在即将发布的 Kafka 2.2 中,我们在优化重新平衡方案方面做了很大改进,主要在 KIP-345 中捕获(https://cwiki.apache.org/confluence/display/KAFKA/KIP-345%3A+Introduce+static+membership+protocol+to+reduce+consumer+rebalances).由于在 KIP-345 中引入了合理的配置设置,因此滚动弹跳将触发更少的重新平衡.所以我强烈建议你升级到 2.2,看看它是否对你的情况有帮助

In the coming Kafka 2.2 though, we've done a big improvement on optimizing rebalance scenarios, primarily captured in KIP-345 (https://cwiki.apache.org/confluence/display/KAFKA/KIP-345%3A+Introduce+static+membership+protocol+to+reduce+consumer+rebalances). With that much fewer rebalances will be triggered with a rolling bounce, with a reasonable config settings introduced in KIP-345. So I'd strongly recommend you to upgrade to 2.2 and see if it helps your case

这篇关于Kafka Streams 重新平衡高吞吐量 kafka-streams 服务的延迟峰值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆