Kafka-streams因消费者正常关闭而延迟重新平衡 [英] Kafka-streams delay to kick rebalancing on consumer graceful shutdown

查看:43
本文介绍了Kafka-streams因消费者正常关闭而延迟重新平衡的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我之前发送的有关Kafka Streams中高延迟的问题的后续解答;( Kafka流在高吞吐量kafka上重新平衡延迟峰值-streams服务).

This is a follow up on a previous question I sent regarding high latency in our Kafka Streams; (Kafka Streams rebalancing latency spikes on high throughput kafka-streams services).

作为一个简短的提醒,我们的无状态服务对延迟的要求非常严格,尤其是当消费者优雅地离开该组时,我们面临着过高的延迟问题(某些消息在生成后消耗了10秒钟以上).

As a quick reminder, our Stateless service has very tight latency requirements and we are facing too high latency problems (some messages consumed more than 10 secs after being produced) specially when a consumer leaves gracefully the group.

经过进一步调查,我们发现至少对于小型消费群体而言,重新平衡花费的时间少于500毫秒.因此,我们认为,从中删除一位消费者(> 10s)时,巨大的延迟在哪里?

After further investigation we have found out that at least for small consumer groups the rebalance is taking less than 500ms. So we thought, where is this huge latency when removing one consumer (>10s) coming from?

我们意识到这是从消费者优雅退出到重新平衡开始的时间.

We realized that it is the time between the consumer exiting Gracefully and the rebalance kicking in.

先前的测试是在Kafka和Kafka Streams应用程序中使用全默认配置执行的.我们将配置更改为:

That previous tests were executed with all-default configurations in both Kafka and Kafka Streams application. We changed the configurations to:

properties.put("max.poll.records", 50); // defaults to 1000 in kafkastreams
properties.put("auto.offset.reset", "latest"); // defaults to latest
properties.put("heartbeat.interval.ms", 1000);
properties.put("session.timeout.ms", 6000);
properties.put("group.initial.rebalance.delay.ms", 0);
properties.put("max.poll.interval.ms", 6000);

结果是重新平衡的时间缩短到5秒多一点.

And the result is that the time for the rebalance to start dropped to a bit more than 5 secs.

我们还测试了通过杀死-9"非优雅地杀死消费者;结果是触发重新平衡的时间完全相同.

We also tested to kill a consumer non-gracefully by 'kill -9' it; the result is that the time to trigger the rebalance is exactly the same.

所以我们有一些问题:-我们期望当消费者正常停止时,立即触发重新平衡,这应该是预期的行为吗?为什么在我们的测试中没有发生?-如何减少消费者正常退出与触发重新平衡之间的时间?权衡是什么?更多不需要的平衡?

So we have some questions: - We expected that when the consumer is stopping gracefully the rebalance is triggered right away, should that be the expected behavior? why isn't it happening in our tests? - How can we reduce the time between a consumer gracefully exiting and the rebalance being triggered? what are the tradeoffs? more unneeded rebalances?

有关更多信息,我们的Kafka版本为1.1.0,在查看了例如kafka/kafka_2.11-1.1.0-cp1.jar的库之后,我们安装了Confluent平台4.1.0.在消费者方面,我们正在使用Kafka-streams 2.1.0.

For more context, our Kafka version is 1.1.0, after looking at libs found for example kafka/kafka_2.11-1.1.0-cp1.jar, we installed Confluent platform 4.1.0. On the consumer side, we are using Kafka-streams 2.1.0.

谢谢!

推荐答案

当实例正常关闭时,Kafka Streams不会发送离开组请求"-这是有意的.目的是避免实例反弹(例如,一个实例升级一个应用程序;或者一个实例在Kubernetes环境中运行并且POD快速自动重启)时避免昂贵的重新平衡.

Kafka Streams does not sent a "leave group request" when an instance is shut down gracefully -- this is on purpose. The goal is to avoid expensive rebalances if an instance is bounced (eg, if one upgrades an application; or if one runs in a Kubernetes environment and a POD is restarted quickly automatically).

为此,使用了非公共配置.您可以通过

To achieve this, a non public configuration is used. You can overwrite the config via

props.put("internal.leave.group.on.close", true); // Streams' default is `false`

这篇关于Kafka-streams因消费者正常关闭而延迟重新平衡的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆