Kafka-streams 延迟启动消费者正常关闭的重新平衡 [英] Kafka-streams delay to kick rebalancing on consumer graceful shutdown

查看:30
本文介绍了Kafka-streams 延迟启动消费者正常关闭的重新平衡的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是对我之前发送的有关 Kafka Streams 中的高延迟问题的跟进;(Kafka Streams 重新平衡高吞吐量 kafka 上的延迟峰值- 流服务).

This is a follow up on a previous question I sent regarding high latency in our Kafka Streams; (Kafka Streams rebalancing latency spikes on high throughput kafka-streams services).

提醒一下,我们的无状态服务有非常严格的延迟要求,而且我们面临着延迟过高的问题(有些消息在生成后消耗超过 10 秒),特别是当消费者优雅地离开组时.

As a quick reminder, our Stateless service has very tight latency requirements and we are facing too high latency problems (some messages consumed more than 10 secs after being produced) specially when a consumer leaves gracefully the group.

经过进一步调查,我们发现至少对于小型消费群体而言,重新平衡的时间不到 500 毫秒.所以我们想,当移除一个消费者(>10s)时,这个巨大的延迟来自哪里?

After further investigation we have found out that at least for small consumer groups the rebalance is taking less than 500ms. So we thought, where is this huge latency when removing one consumer (>10s) coming from?

我们意识到这是消费者优雅退出和重新平衡开始之间的时间.

We realized that it is the time between the consumer exiting Gracefully and the rebalance kicking in.

之前的测试是在 Kafka 和 Kafka Streams 应用程序中使用全默认配置执行的.我们将配置更改为:

That previous tests were executed with all-default configurations in both Kafka and Kafka Streams application. We changed the configurations to:

properties.put("max.poll.records", 50); // defaults to 1000 in kafkastreams
properties.put("auto.offset.reset", "latest"); // defaults to latest
properties.put("heartbeat.interval.ms", 1000);
properties.put("session.timeout.ms", 6000);
properties.put("group.initial.rebalance.delay.ms", 0);
properties.put("max.poll.interval.ms", 6000);

结果是重新平衡开始的时间下降到 5 秒多一点.

And the result is that the time for the rebalance to start dropped to a bit more than 5 secs.

我们还测试了通过kill -9"非优雅地杀死消费者;结果是触发重新平衡的时间完全相同.

We also tested to kill a consumer non-gracefully by 'kill -9' it; the result is that the time to trigger the rebalance is exactly the same.

所以我们有一些问题:- 我们期望当消费者正常停止时,立即触发重新平衡,这应该是预期的行为吗?为什么在我们的测试中没有发生?- 我们如何减少消费者正常退出和触发重新平衡之间的时间?权衡是什么?更多不需要的再平衡?

So we have some questions: - We expected that when the consumer is stopping gracefully the rebalance is triggered right away, should that be the expected behavior? why isn't it happening in our tests? - How can we reduce the time between a consumer gracefully exiting and the rebalance being triggered? what are the tradeoffs? more unneeded rebalances?

更多的上下文,我们的Kafka版本是1.1.0,在查看了例如kafka/kafka_2.11-1.1.0-cp1.jar的库后,我们安装了Confluent平台4.1.0.在消费者方面,我们使用的是 Kafka-streams 2.1.0.

For more context, our Kafka version is 1.1.0, after looking at libs found for example kafka/kafka_2.11-1.1.0-cp1.jar, we installed Confluent platform 4.1.0. On the consumer side, we are using Kafka-streams 2.1.0.

谢谢!

推荐答案

当实例正常关闭时,Kafka Streams 不会发送离开组请求"——这是故意的.目标是在实例被退回时避免代价高昂的重新平衡(例如,如果一个应用程序升级;或者如果一个应用程序在 Kubernetes 环境中运行并且一个 POD 自动快速重启).

Kafka Streams does not sent a "leave group request" when an instance is shut down gracefully -- this is on purpose. The goal is to avoid expensive rebalances if an instance is bounced (eg, if one upgrades an application; or if one runs in a Kubernetes environment and a POD is restarted quickly automatically).

为了实现这一点,使用了非公开配置.您可以通过

To achieve this, a non public configuration is used. You can overwrite the config via

props.put("internal.leave.group.on.close", true); // Streams' default is `false`

这篇关于Kafka-streams 延迟启动消费者正常关闭的重新平衡的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆