kafka-python 消费者从偏移量开始读取(自动) [英] kafka-python consumer start reading from offset (automatically)
问题描述
我正在尝试使用 kafka-python 构建一个应用程序,其中消费者从一系列主题中读取数据.消费者绝不会两次读取同一条消息,而且绝不会错过任何一条消息,这一点极为重要.
I'm trying to build an application with kafka-python where a consumer reads data from a range of topics. It is extremely important that the consumer never reads the same message twice, but also never misses a message.
似乎一切正常,除非我关闭消费者(例如失败)并尝试从偏移量开始读取.我只能读取主题中的所有消息(这会造成双重读取)或仅收听新消息(并错过在故障期间发出的消息).我在暂停消费者时没有遇到这个问题.
Everything seems to be working fine, except when I turn off the consumer (e.g. failure) and try to start reading from offset. I can only read all the messages from the topic (which creates double reads) or listen for new messages only (and miss messages that where emitted during the breakdown). I don't encounter this problem when pausing the consumer.
我创建了一个孤立的模拟来尝试解决问题.
I created an isolated simulation in order to try to solve the problem.
这里是通用生产者:
from time import sleep
from json import dumps
from kafka import KafkaProducer
producer = KafkaProducer(bootstrap_servers=['localhost:9092'])
x=0 # set manually to avoid duplicates
for e in range(1000):
if e <= x:
pass
else:
data = dumps(
{
'number' : e
}
).encode('utf-8')
producer.send('numtest', value=data)
print(e, ' send.')
sleep(5)
还有消费者.如果auto_offset_reset
设置为'earliest'
,则所有消息将被再次读取.如果 auto_offset_reset
设置为 'latest'
,则停机期间不会读取任何消息.
And the consumer. If auto_offset_reset
is set to 'earliest'
, all the messages will be read again. If auto_offset_reset
is set to 'latest'
, no messages during down-time will be read.
from kafka import KafkaConsumer
from pymongo import MongoClient
from json import loads
## Retrieve data from kafka (WHAT ABOUT MISSED MESSAGES?)
consumer = KafkaConsumer('numtest', bootstrap_servers=['localhost:9092'],
auto_offset_reset='earliest', enable_auto_commit=True,
auto_commit_interval_ms=1000)
## Connect to database
client = MongoClient('localhost:27017')
collection = client.counttest.counttest
# Send data
for message in consumer:
message = loads(message.value.decode('utf-8'))
collection.insert_one(message)
print('{} added to {}'.format(message, collection))
我觉得自动提交功能不正常.
I feel like the auto-commit isn't working properly.
我知道这个问题类似于这个,但我想要一个具体的解决方案.
I know that this questions is similar to this one, but I would like a specific solution.
谢谢你帮助我.
推荐答案
您之所以出现这种行为,是因为您的消费者没有使用消费者组.通过消费者组,消费者将定期向 Kafka 提交(保存)其位置.这样,如果它重新启动,它将从上次提交的位置恢复.
You are getting this behavior because your consumer is not using a Consumer Group. With a Consumer Group, the consumer will regularly commit (save) its position to Kafka. That way if it's restarted it will pick up from its last committed position.
为了让你的消费者使用一个消费者组,你需要在构造它时设置group_id
.请参阅 docs 中的 group_id
说明:
To make your consumer use a Consumer Group, you need to set group_id
when constructing it.
See group_id
description from the docs:
动态分区要加入的消费者组名称分配(如果启用),并用于获取和提交抵消.如果没有,自动分区分配(通过组协调器)和偏移提交被禁用.默认值:无
The name of the consumer group to join for dynamic partition assignment (if enabled), and to use for fetching and committing offsets. If None, auto-partition assignment (via group coordinator) and offset commits are disabled. Default: None
例如:
consumer = KafkaConsumer('numtest', bootstrap_servers=['localhost:9092'],
auto_offset_reset='earliest', enable_auto_commit=True,
auto_commit_interval_ms=1000, group_id='my-group')
这篇关于kafka-python 消费者从偏移量开始读取(自动)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!