消费者陷入重新加入 [英] Consumer Stuck in Re-join

查看:21
本文介绍了消费者陷入重新加入的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已阅读其他主题,并通过使用新的组 ID 解决了该问题,但我想了解可能导致此问题的原因.

I've read other threads and I've gotten around the problem by using a new group ID, however I'd like to understand what could cause this.

我有一个有 16 个分区的主题,我设置了 session.timeout.ms=30000 和 max.poll.interval.ms=30000000.

I have a topic with 16 partitions, I've set session.timeout.ms=30000, and max.poll.interval.ms=30000000.

我运行我的程序,然后按 ctrl+c,所以它没有正确关闭.在我猜了 16 次之后,我陷入了这个重新加入的问题.session.timeout.ms 是心跳超时,所以 30 秒后它应该踢我的消费者权利,我的分区应该释放"对吗?还是只听我的 max.poll.interval.ms ?

I run my program, and ctrl+c it, so it's not closing properly. After I guess, 16 times, I get stuck in this re-join issue. session.timeout.ms is the heartbeat timeout, so after 30 seconds it should kick my consumer right and my partitions should "free up" right? Or is it only listening to my max.poll.interval.ms?

我仍然间歇性地收到这个错误,当它发生时,我必须重新启动我的所有消费者.即使我的消费者运行良好,然后他们开始在重新加入时都陷入困境(没有添加/删除消费者),也会发生这种情况.这是一个错误日志,当我尝试连接到它时,当它卡在该状态时与新使用者连接:

I still get this error intermittently, and when it happens i have to restart all my consumers. This happens even when my consumers were running fine and then they start all getting stuck at rejoining (no consumers were added/removed). Here's an error log from when I try to connect to it after with a new consumer when it's stuck in that state :

https://pastebin.com/AXJeSHkp

2017-06-29 17:28:16,215 DEBUG [AbstractCoordinator] - [scheduler-1] - Sending JoinGroup ((type: JoinGroupRequest, groupId=ingestion-matching-kafka-consumer-group-dev1, sessionTimeout=30000, rebalanceTimeout=43200000, memberId=, protocolType=consumer, groupProtocols=org.apache.kafka.common.requests.JoinGroupRequest$ProtocolMetadata@b45e5583)) to coordinator kafka04-prod01.messagehub.services.us-south.bluemix.net:9093 (id: 2147483644 rack: null)

2017-06-29 17:37:21,261 DEBUG [NetworkClient] - [scheduler-1] - Node 2147483644 disconnected.
2017-06-29 17:37:21,263 DEBUG [ConsumerNetworkClient] - [scheduler-1] - Cancelled JOIN_GROUP request {api_key=11,api_version=1,correlation_id=19,client_id=ingestion-matching-kafka-consumer-dev1} with correlation id 19 due to node 2147483644 being disconnected

这些是我认为相关的第一条和最后一条消息.以下是我设置的相关超时:

Those are the first and last messages I think are relevant. Here are the relevant timeouts I've set:

session.timeout.ms=30000
max.poll.interval.ms=43200000    
request.timeout.ms=43205000 # the docs said to keep this higher than max.poll.interval.ms
enable.auto.commit=false

我也应该设置 heartbeat.interval.ms 吗?这是消费者在某个后台线程中自动向代理发送心跳的时间间隔(我已经阅读了文档,但由于某种原因我无法完全理解它)?

Should I set heartbeat.interval.ms too? Is this the interval that heartbeats are sent by the consumer to the broker automatically in some background thread (I have read the docs but for some reason I can't quite wrap my head around it)?

推荐答案

我知道这是一个很老的问题,但我遇到了类似的问题,最后我明白了这种情况的原因并想分享.

I know it's a quite old question but I had similar issue and finally I understood the reason of this situation and want to share.

当重新平衡开始时,Kafka 等待组中的所有消费者 poll() 并发送 joinGroup 请求.重新平衡超时等于 max.poll.interval.ms.因此,Kafka 会等待每个消费者的重新平衡超时或进程结束.

When rebalance starts Kafka waits all consumers in the group to poll() and send joinGroup request. Rebalance timeout is equal to max.poll.interval.ms. So Kafka waits until rebalance timeout or end of the process for each consumer.

在您的情况下,您将 max.poll.interval.ms 设置为 12 小时.唯一合理的理由是你必须有一个漫长的过程.因此,当重新平衡开始时,Kafka 将等到您的流程完成或 12 小时过去.这就是您的消费者似乎陷入困境的原因.

In your case you set max.poll.interval.ms to 12 hours. Only sensible reason to that you must have a long process. So when rebalance starts Kafka will wait until your process is finished or 12 hours is passed. That's why your consumer seems stuck.

这篇关于消费者陷入重新加入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆