从 Kafka 死信队列重试消息的最佳实践是什么 [英] What is the best practice to retry messages from Dead letter Queue for Kafka

查看:124
本文介绍了从 Kafka 死信队列重试消息的最佳实践是什么的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们使用 Kafka 作为微服务之间的消息传递系统.我们有一个 kafka 消费者监听特定主题,然后将数据发布到另一个主题中,由 Kafka 连接器接收,后者负责将其发布到某个数据存储中.

We are using Kafka as messaging system between micro-services. We have a kafka consumer listening to a particular topic and then publishing the data into another topic to be picked up by Kafka Connector who is responsible for publishing it into some data storage.

我们使用 Apache Avro 作为序列化机制.

We are using Apache Avro as serialization mechanism.

我们需要开启 DLQ 来为 Kafka Consumer 和 Kafka Connector 添加容错功能.

We need to enable the DLQ to add the fault tolerance to the Kafka Consumer and the Kafka Connector.

由于多种原因,任何消息都可能转移到 DLQ:

Any message could move to DLQ due to multiple reasons:

  1. 格式错误
  2. 不良数据
  3. 对大量消息进行节流,因此某些消息可能会转移到 DLQ
  4. 由于连接问题,发布到数据存储失败.

对于上述第 3 点和第 4 点,我们想重新尝试从 DLQ 发送消息.

For the 3rd and 4th points as above, we would like to re-try message again from DLQ.

同样的最佳实践是什么.请指教.

What is the best practice on the same. Please advise.

推荐答案

仅推送到导致不可重试错误的 DLQ 记录,即:示例中的点 1(格式错误)和点 2(错误数据).对于 DLQ 记录的格式,一个好的方法是:

Only push to DLQ records that cause non-retryable errors, that is: point 1 (bad format) and point 2 (bad data) in your example. For the format of the DLQ records, a good approach is to:

  • 将与原始记录完全相同的 kafka 记录值和密钥推送到 DLQ,不要将其包裹在任何类型的信封中.这使得在故障排除期间使用其他工具重新处理变得更加容易(例如使用新版本的解串器等).
  • 添加一堆Kafka标头来传达有关错误的元数据,一些典型的例子是:
    • 这条记录的原始主题名称、分区、偏移量和Kafka时间戳
    • 异常或错误消息
    • 未能处理该记录的应用程序的名称和版本
    • 错误发生时间

    我通常为每个服务或应用程序使用一个 DLQ 主题(不是每个入站主题一个,也不是跨服务共享的主题).这往往会使事情保持独立和易于管理.

    Typically I use one single DLQ topic per service or application (not one per inbound topic, not a shared one across services). That tends to keep things independent and manageable.

    哦,您可能想对 DLQ 主题的入站流量进行一些监控和警报;)

    Oh, and you probably want to put some monitoring and alert on the inbound traffic to the DLQ topic ;)

    恕我直言,第 3 点(高容量)应该处理某种自动缩放,而不是 DLQ.尝试总是高估(有点)输入主题的分区数,因为您可以启动服务的最大实例数受此限制.过多的消息不会使您的服务过载,因为 Kafka 消费者在决定时会明确轮询更多消息,因此他们永远不会要求超过应用程序处理能力的消息.如果出现消息高峰会发生什么,它们只会在上游 kafka 主题中不断堆积.

    Point 3 (high volume) should, IMHO, be dealt with some sort of auto-scaling, not with a DLQ. Try to always over-estimate (a bit) the number of partitions of the input topic, since the maximum number of instances you can start of your service is limited by that. A too high number of messages is not going to overload your service, since the Kafka consumers are explicitly polling for more messages when they decide to, so they're never asking for more than the app can process. What happens if there is a peak of messages is simply they'll keep piling up in the upstream kafka topic.

    点 4(连接)应该直接从源主题重试,不涉及任何 DLQ,因为错误是暂时的.将消息丢弃到 DLQ 并接收下一条消息不会解决任何问题,因为连接问题仍然存在,下一条消息也可能会被丢弃.读取或不读取来自 Kafka 的记录不会让它消失,因此存储在那里的记录稍后很容易再次读取.只有当服务成功将结果记录写入出站主题时,您才能对服务进行编程以前进到下一条入站记录(请参阅 Kafka 事务:读取主题实际上涉及 write 操作,因为新的消费者偏移量需要被持久化,所以你可以告诉你的程序将新的偏移量和输出记录作为同一个原子事务的一部分).

    Point 4 (connectivity) should be retried directly from the source topic, without any DLQ involved, since the error is transient. Dropping the message to a DLQ and picking up the next one is not going to solve any issue since, well, the connectivity issue will still be present and the next message would likely be dropped as well. Reading, or not reading, a record from Kafka is not making it go away, so a record stored there is easy to read again later. You can program your service to move forward to the next inbound record only if it successfully writes a resulting record to the outbound topic (see Kafka transactions: reading a topic is actually involving a write operation since the new consumer offsets need to be persisted, so you can tell your program to persist new offsets and the output records as part of the same atomic transaction).

    Kafka 更像是一个存储系统(只有 2 个操作:顺序读取和顺序写入)而不是消息队列,它擅长持久性、数据复制、吞吐量、规模......(......和炒作;)).它往往非常适合将数据表示为一系列事件,如事件源".如果这个微服务设置的需求主要是异步点对点消息传递,并且如果大多数场景更喜欢超低延迟并选择丢弃消息而不是重新处理旧消息(正如列出的 4 点所建议的那样),也许像Redis queues这样的有损内存队列系统更合适?

    Kafka is more like a storage system (with just 2 operations: sequential reads and sequential writes) than a messaging queue, it's good at persistence, data replication, throughput, scale... (...and hype ;) ). It tends to be really good for representing data as a sequence of events, as in "event sourcing". If the needs of this microservice setup is mostly asynchronous point-to-point messaging, and if most scenarios would rather favor super low latency and choose to drop messages rather than reprocessing old ones (as seems suggested by the 4 points listed), maybe a lossy in-memory queuing system like Redis queues is more appropriate?

    这篇关于从 Kafka 死信队列重试消息的最佳实践是什么的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆