Kafka Streams 应用程序无休止的再平衡 [英] Kafka Streams application Endless rebalancing

查看:16
本文介绍了Kafka Streams 应用程序无休止的再平衡的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们正在运行一个 kafka 流应用程序,但遇到了一个奇怪的问题.我们同时使用全局状态存储和多个其他状态存储.

We are running a kafka streams application and stuck with a strange problem. We are using both global state store and multiple other state stores.

我们的应用程序已经加载了所有数据,状态存储现在包含大量信息.现在,当我们尝试关闭应用程序并再次将其恢复(一些配置更改)时,它会进入无休止的重新平衡......为了验证我们恢复了配置更改,但它仍然停留在那个阶段.没有错误等

Our application has loaded all the data and state stores has good amount of information in it now. Now, when we tried to bring down the application and bring it back again (some config changes), it is going into endless rebalancing .. To verify we reverted back config changes, but it it still stuck in that stage. There are no erros, etc

INFO  o.apache.kafka.streams.KafkaStreams - stream-client [app-1f6b14fc-685c-49fb-83c0-54e15bca15cb] Started Streams client
INFO  o.a.k.s.p.internals.StreamThread - stream-thread [app-1f6b14fc-685c-49fb-83c0-54e15bca15cb-StreamThread-2] State transition from RUNNING to PARTITIONS_REVOKED
INFO  o.apache.kafka.streams.KafkaStreams - stream-client [app-1f6b14fc-685c-49fb-83c0-54e15bca15cb] State transition from RUNNING to REBALANCING
INFO  o.a.k.s.p.internals.StreamThread - stream-thread [app-1f6b14fc-685c-49fb-83c0-54e15bca15cb-StreamThread-2] partition revocation took 1 ms.
    suspended active tasks: []
    suspended standby tasks: []
INFO  o.a.k.s.p.internals.StreamThread - stream-thread [app-1f6b14fc-685c-49fb-83c0-54e15bca15cb-StreamThread-1] State transition from RUNNING to PARTITIONS_REVOKED
INFO  o.a.k.s.p.internals.StreamThread - stream-thread [app-1f6b14fc-685c-49fb-83c0-54e15bca15cb-StreamThread-1] partition revocation took 0 ms.
    suspended active tasks: []
    suspended standby tasks: []
04:02:13.682 6985 [main] INFO  com..... - Started Application in 6.647 seconds (JVM running for 7.484)
04:02:23.300 16603 [app-1f6b14fc-685c-49fb-83c0-54e15bca15cb-StreamThread-1] INFO  o.a.k.s.p.internals.StreamThread - stream-thread [app-1f6b14fc-685c-49fb-83c0-54e15bca15cb-StreamThread-1] State transition from PARTITIONS_REVOKED to PARTITIONS_ASSIGNED
04:02:23.300 16603 [app-1f6b14fc-685c-49fb-83c0-54e15bca15cb-StreamThread-2] INFO  o.a.k.s.p.internals.StreamThread - stream-thread [app-1f6b14fc-685c-49fb-83c0-54e15bca15cb-StreamThread-2] State transition from PARTITIONS_REVOKED to PARTITIONS_ASSIGNED
04:02:23.328 16631 [app-1f6b14fc-685c-49fb-83c0-54e15bca15cb-StreamThread-1] INFO  o.a.k.s.p.internals.StreamThread - stream-thread [app-1f6b14fc-685c-49fb-83c0-54e15bca15cb-StreamThread-1] partition assignment took 28 ms.
    current active tasks: [0_0, 1_0, 2_0, 3_0, 4_0, 5_0, 6_0, 7_5, 8_5, 9_5, 10_5, 12_4, 13_4, 14_4, 15_4, 16_4, 17_4, 19_3, 20_3, 21_3, 22_3, 23_3, 24_3, 25_3, 29_0]
    current standby tasks: [0_2]
    previous active tasks: []

04:02:23.328 16631 [app-1f6b14fc-685c-49fb-83c0-54e15bca15cb-StreamThread-2] INFO  o.a.k.s.p.internals.StreamThread - stream-thread [app-1f6b14fc-685c-49fb-83c0-54e15bca15cb-StreamThread-2] partition assignment took 28 ms.
    current active tasks: [0_3, 1_3, 2_3, 3_3, 4_3, 5_3, 7_2, 8_2, 9_2, 10_2, 12_1, 13_1, 14_1, 15_1, 16_1, 17_1, 19_0, 20_0, 21_0, 22_0, 23_0, 24_0, 25_0, 26_0]
    current standby tasks: [0_5]
    previous active tasks: []
04:03:47.602 100905 [http-nio-8080-exec-10] INFO  c.j.d.r.b.p.base.StreamsRestService - State of Kafka Streams Application: REBALANCING
04:03:49.356 102659 [http-nio-8080-exec-2] INFO  c.j.d.r.b.p.base.StreamsRestService - State of Kafka Streams Application: REBALANCING
04:03:51.600 104903 [http-nio-8080-exec-3] INFO  c.j.d.r.b.p.base.StreamsRestService - State of Kafka Streams Application: REBALANCING
04:03:53.356 106659 [http-nio-8080-exec-4] INFO  c.j.d.r.b.p.base.StreamsRestService - State of Kafka Streams Application: REBALANCING

Number of topics - 100
Partitions per topic - 6.  (7 topics with 1 partition only)
kubernetes env - 3 pods ( 2 stream threads )

当我们尝试使用以下命令列出消费者组时

When we try to list consumer group using following command

root@bastion-0:/app/confluent-5.2.2/bin# ./kafka-consumer-groups --describe --group app  --bootstrap-server kafka-0..local:9094 --command-config /app/client-sasl-ssl.properties --members

CONSUMER-ID                                                                                               HOST                    CLIENT-ID                                                            #PARTITIONS     
app-b8c729c9-dc1c-457b-8120-457035e84e58-StreamThread-1-consumer-3b370697-e737-411c-af28-fb04cfbae1dd 1.1.1.1/1.1.1.1 app-b8c729c9-dc1c-457b-8120-457035e84e58-StreamThread-1-consumer 45              
app-aaef2f83-d51c-4b6f-bbd8-616db988bd33-StreamThread-2-consumer-3edb3e5f-9f1a-499f-8732-6cd2c8b96c96 2.2.2.2/2.2.2.2 app-aaef2f83-d51c-4b6f-bbd8-616db988bd33-StreamThread-2-consumer 45              
app-1f6b14fc-685c-49fb-83c0-54e15bca15cb-StreamThread-1-consumer-00e24df4-5669-4e2c-a775-8f6c4f689714 3.3.3.3/3.3.3.3 app-1f6b14fc-685c-49fb-83c0-54e15bca15cb-StreamThread-1-consumer 46              
app-b8c729c9-dc1c-457b-8120-457035e84e58-StreamThread-2-consumer-1b6b2955-5dfd-4be7-8ad9-9f1b54fe6310 1.1.1.1/1.1.1.1 app-b8c729c9-dc1c-457b-8120-457035e84e58-StreamThread-2-consumer 45              
app-aaef2f83-d51c-4b6f-bbd8-616db988bd33-StreamThread-1-consumer-72cd0319-8ca7-493c-891d-3022b235ea01 2.2.2.2/2.2.2.2 app-aaef2f83-d51c-4b6f-bbd8-616db988bd33-StreamThread-1-consumer 45              
app-1f6b14fc-685c-49fb-83c0-54e15bca15cb-StreamThread-2-consumer-c1a16d64-8d49-4758-ab64-2af3cd9aef0f 3.3.3.3/3.3.3.3 app-1f6b14fc-685c-49fb-83c0-54e15bca15cb-StreamThread-2-consumer 45   

上述命令的输出不断变化 - 从 0 到某个可变数字.理想情况下,它应该在一段时间后变得稳定.

The output of the above command keeps on changing - from 0 to some variable number. Ideally it should become stable after some time.

是否有用于 kafka 流平衡(重新平衡)的任何可调参数/配置

Are there any tunables/configs for kafka streams balancing (rebalancing)

问题:

  1. 是什么导致应用程序在启动时不断重新平衡(即使没有错误/异常等).

  1. What causes application to rebalance endlessly while starting (even though there are no errors/exception, etc).

是否有任何可以帮助我们避免重新平衡的可调参数?

Is there any tunables which can help us avoid rebalancing ?

推荐答案

查看您添加的日志,消费者 Pod 正在启动,所以我猜可能其他 2 个 Pod 会滚动重启,因此需要重新平衡每次一停一开始.

Looking at the logs you have added, the consumer pod is starting up and so I guess maybe there is a rolling restart of the other 2 pods and hence a rebalance each time one stops and one starts.

虽然Kafka在运行rebalance时速度并不快,因为在此过程中跨组聊天 - 尽管分区可能分配给一个消费者,但该组仅在所有消费者都分配完毕后才开始消费,并且发现分配仅发生在 poll 方法中(参见 https://chrisg23.blogspot.com/2020/02/why-is-pausing-kafka-consumer-so.html).

Although Kafka is fast when running rebalance is not fast as there is chat across the group during the process - although partitions may be assigned to one consumer, the group only starts consuming when all consumers have had their assignment, and the discovery of assignment only happens within the poll method (see https://chrisg23.blogspot.com/2020/02/why-is-pausing-kafka-consumer-so.html).

因此加快进程的方法是更频繁地轮询,以便您更快地了解更改,但有一个权衡 - 如果在正常运行中主题不忙,那么将会有很多旋转什么都不做.

Hence the way to speed up the process is to poll more frequently so that you get to hear about changes quicker, but there is a trade off - if in normal running the topics are not busy then there will be a lot of spinning doing nothing.

但是,您对无休止的含义并不十分清楚.如果您的意思是应用程序实际上只是重新平衡,那么请参阅我上面的评论.可能是 pod 不断上升和下降(心跳停止)或者轮询需要很长时间 - 您是否为每条记录进行了大量 I/O?从日志和 pod 名称可以明显看出重启.过度轮询还会导致警告消息,建议您增加 max.poll.interval.ms 或减少 max.poll.records

However, you are not quite clear on what you mean by endlessly. If you mean that the application is literally only rebalancing then see my comment above. It may be that pods are going up and down continuously (heartbeats dying) or else polling is taking a long time - are you doing a lot of I/O for each record? Restarts would be obvious from the logs and the pod names. Excessive polling would also cause warning messages suggesting you either increase max.poll.interval.ms or reduce max.poll.records

这篇关于Kafka Streams 应用程序无休止的再平衡的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆