Kubernetes 上的 Kafka Streams:重新部署后的长时间重新平衡 [英] Kafka Streams on Kubernetes: Long rebalancing after redeployment

查看:53
本文介绍了Kubernetes 上的 Kafka Streams:重新部署后的长时间重新平衡的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们使用 StatefulSet 在 Kubernetes 上部署 Scala Kafka Streams 应用程序.实例具有单独的 applicationId,因此它们每个都复制完整的输入主题以实现容错.它们本质上是只读服务,只读取状态主题并将其写入状态存储,从那里通过 REST 为客户请求提供服务.这意味着,在任何给定时间,消费者组始终只包含一个单个 Kafka Streams 实例.

We use a StatefulSet to deploy a Scala Kafka Streams application on Kubernetes. The instances have separate applicationIds, so they each replicate the complete input topic for fault-tolerance. They are essentially read-only services that only read in a state topic and write it to the state store, from where customer requests are served via REST. That means, the consumer group always consist of only a single Kafka Streams instance at any given time.

我们现在的问题是,在触发滚动重启时,每个实例大约需要 5 分钟才能启动,其中大部分时间都花在等待 REBALANCING 状态.我已经阅读此处 Kafka Streams 不会发送 LeaveGroup 请求以返回容器重启后快速,无需重新平衡.为什么这对我们不起作用,为什么即使 applicationId 是相同的,重新平衡需要这么长时间?理想情况下,为了最大限度地减少停机时间,应用程序应立即从重新启动时离开的地方接管.

Our problem is now that, when triggering a rolling restart, each instance takes about 5 minutes to start up, where most of the time is spent waiting in the REBALANCING state. I've read here that Kafka Streams does not send a LeaveGroup request in order to come back fast after a container restart, without rebalancing. How come this does not work for us and why does the rebalancing take so long, even though the applicationId is identical? Ideally, to minimize downtime, the application should take over immediately from where it left when it was restarted.

以下是我们从默认值更改的一些配置:

Here are some configs we changed from the default values:

properties.put(StreamsConfig.consumerPrefix(ConsumerConfig.MAX_POLL_RECORDS_CONFIG), "1000")
properties.put(StreamsConfig.consumerPrefix(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG), "300000")
properties.put(StreamsConfig.consumerPrefix(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG), "earliest")
// RocksDB config, see https://docs.confluent.io/current/streams/developer-guide/memory-mgmt.html
properties.put(StreamsConfig.ROCKSDB_CONFIG_SETTER_CLASS_CONFIG, classOf[BoundedMemoryRocksDBConfig])    

问题/相关配置

  • 是否有助于减少 session.timeout.ms?我们将其设置为相当大的值,因为 Kafka 代理位于不同的数据中心,并且网络连接有时不是非常可靠.
  • 这个答案建议减少 max.poll.interval.ms,因为它是与重新平衡超时有关.那是对的吗?我不愿更改此设置,因为它可能会对我们应用的正常运行模式产生影响.
  • 提到一个配置group.initial.rebalance.delay.ms在部署期间延迟重新平衡 - 但这也会导致从崩溃中恢复后的延迟,不是吗?
  • 我还偶然发现了 KIP-345,旨在完全通过 group.instance.id 消除静态会员的消费者重新平衡,这将非常适合我们的用户案例,但它似乎还没有在我们的经纪人上提供.
  • Questions / Related configs

    • Would it help to decrease session.timeout.ms? We set it to quite a large value as the Kafka brokers live in a different data center and network connections are at times not super reliable.
    • This answer suggests to decrease max.poll.interval.ms, as it is tied to a rebalance timeout. Is that correct? I'm hesitant to change this, as it might have consequences on the normal operation mode of our app.
    • There is mention of a config group.initial.rebalance.delay.ms to delay rebalancing during a deployment - but that would cause delays also after recovery from a crash, wouldn't it?
    • I also stumbled upon KIP-345, which targets to eliminate consumer rebalancing for static memberships entirely via group.instance.id, which would be a good fit for our user case, but it does not seem to be available yet on our brokers.
    • 我对众多配置以及如何使用它们在更新后启用快速恢复感到困惑.有人可以解释一下他们是如何一起玩的吗?

      I'm confused by the multitude of configs and how to use them to enable fast recovery after an update. Can someone explain how they play together?

      推荐答案

      您引用的另一个问题并不是说在重新启动时避免重新平衡.不发送 LeaveGroupRequest 只会在您停止应用程序时避免重新平衡.因此,重新平衡的次数从两个减少到一个.当然,通过你有点不寻常的单实例部署,你在这里没有任何收获(事实上,它实际上可能伤害"了你......)a

      The other question you cite does not say that a rebalance is avoided on restart. Not sending a LeaveGroupRequest only avoids a rebalance when you stop the app. Hence, the number of rebalances is reduced from two to one. Of course, with your somewhat unusual single-instance deployment, you don't gain anything here (in fact, it might actually "hurt" you...)a

      减少 session.timeout.ms 会有帮助吗?我们将其设置为相当大的值,因为 Kafka 代理位于不同的数据中心,并且网络连接有时不是非常可靠.

      Would it help to decrease session.timeout.ms? We set it to quite a large value as the Kafka brokers live in a different data center and network connections are at times not super reliable.

      可能,取决于您重新启动应用程序的速度.(更多细节如下.)也许只是尝试一下(即,将其设置为 3 分钟以保持较高的稳定性值,并看到重新平衡时间下降到 3 分钟?

      Could be, depending how quickly you restart the app. (More details below.) Maybe just try it out (ie, set it to 3 minutes to still have a high value for stability and see it the rebalance time drop to 3 minutes?

      这个答案建议减少 max.poll.interval.ms,因为它与重新平衡超时有关.那是对的吗?我对改变这一点犹豫不决,因为它可能会对我们应用的正常运行模式产生影响.

      This answer suggests to decrease max.poll.interval.ms, as it is tied to a rebalance timeout. Is that correct? I'm hesitant to change this, as it might have consequences on the normal operation mode of our app.

      max.poll.interval.ms 也会影响重新平衡时间(更多细节见下文).但是,默认值为 30 秒,因此不应导致 5 分钟的重新平衡时间.

      max.poll.interval.ms also affects rebalance time (more details below). However, default value is 30 seconds and thus should not result in a 5 minute rebalance time.

      提到配置 group.initial.rebalance.delay.ms 在部署期间延迟重新平衡 - 但这也会导致从崩溃中恢复后的延迟,不是吗?

      There is mention of a config group.initial.rebalance.delay.ms to delay rebalancing during a deployment - but that would cause delays also after recovery from a crash, wouldn't it?

      这仅适用于空消费者组,默认值仅为 3 秒.所以它应该不会影响到你.

      This only applies to empty consumer groups and the default value is just 3 seconds. So it should not affect you.

      我还偶然发现了 KIP-345,它旨在完全通过 group.instance.id 消除静态会员的消费者重新平衡,这将非常适合我们的用户案例,但它似乎还没有在我们的经纪人.

      I also stumbled upon KIP-345, which targets to eliminate consumer rebalancing for static memberships entirely via group.instance.id, which would be a good fit for our user case, but it does not seem to be available yet on our brokers.

      使用静态群组成员资格实际上可能是最好的选择.也许值得升级您的经纪人以获得此功能.

      Using static group membership might actually be the best bet. Maybe it's worth to upgrade your brokers to get this feature.

      顺便说一句,session.timeout.msmax.poll.interval.ms 之间的区别在另一个问题中有解释:session.timeout.ms 之间的差异以及 Kafka 0.10.0.0 及更高版本的 max.poll.interval.ms

      Btw, the difference between session.timeout.ms and max.poll.interval.ms is explained in another question: Difference between session.timeout.ms and max.poll.interval.ms for Kafka 0.10.0.0 and later versions

      一般来说,broker 端的 group-coordinator 维护每个group generation"的所有成员的列表.如果成员主动离开组(通过发送 LeaveGroupRequest)、超时(通过 session.timeout.msmax.poll.interval.ms),或者有新成员加入.如果发生重新平衡,每个成员都有机会重新加入该组以被包含在下一代中.

      In general, the broker side group-coordinator maintains a list of all member per "group generation". A rebalance is triggered if a member leave the group actively (via sending LeaveGroupRequest), times out (via session.timeout.ms or max.poll.interval.ms), or a new member joins the group. If a rebalance happens, each member gets a chance to rejoin the group to be included in the next generation.

      就您的情况而言,该群组只有一名成员.当您停止应用时,不会发送 LeaveGroupRequest,因此只有在 session.timeout.ms 传递后,组协调器才会删除该成员.

      For your case, the group has only one member. When you stop your app, no LeaveGroupRequest is sent and thus the group-coordinator would remove this member only after session.timeout.ms passed.

      如果您重新启动应用程序,它会作为新"成员返回(从组协调员的角度来看).这将触发重新平衡,使该组的所有成员都可以重新加入该组.对于您的情况,旧"实例可能仍在组中,因此只有在组协调员从组中删除旧成员后,重新平衡才会向前推进.问题可能是,组协调员认为组从一个成员扩展到两个成员......(这就是我上面的意思:如果 LeaveGroupRequest 将被发送,该组将成为当您停止应用程序时为空,并且在重新启动时只有新成员会在组中并且重新平衡将立即进行.)

      If you restart the app, it comes back as a "new" member (from a group-coordinator point of view). This would trigger a reblance, giving all member of the group a change to re-join the group. For your case, the "old" instance might still be in the group and thus the rebalance would only move forward after the group-coordinator removed the old member from the group. The issue might be, that the group-coordinator thinks that the group scales out from one to two members... (This is what I meant above: if a LeaveGroupRequest would be sent, the group would become empty when you stop you app, and on restart only the new member would be in the group and the rebalance would move forward immediately.)

      使用静态组成员身份可以避免这个问题,因为在重新启动时,实例可能会被重新识别为旧"实例,并且组协调器不需要等待旧组成员过期.

      Using static group membership would avoid the issue, because on restart the instance could be re-identified as the "old" instance, and the group-coordinator does not need to wait to expire the old group member.

      这篇关于Kubernetes 上的 Kafka Streams:重新部署后的长时间重新平衡的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆