Kafka Log Compaction 总是显示相同键的最后两条记录 [英] Kafka Log Compaction always shows two last records of same key

查看:19
本文介绍了Kafka Log Compaction 总是显示相同键的最后两条记录的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

发现这两个问题:here这里,但我还是不太明白.我仍然有(意外?)行为.

Found these two questions : here and here, but I still don't quite understand. I still got (unexpected?) behaviour.

我尝试使用此配置对 kafka 主题进行日志压缩

I try to log-compact kafka topic using this configuration

kafka-topics.sh --bootstrap-server localhost:9092 --create --partitions 1 --replication-factor 1 --topic test1 --config "cleanup.policy=compact" --config "delete.retention.ms=1000" --config "segment.ms=1000" --config "min.cleanable.dirty.ratio=0.01" --config "min.compaction.lag.ms=500"

然后我发送这些消息,每条消息至少有 1 秒的间隔

Then I send these messages, each has at least 1 second interval

A: 3
A: 4
A: 5
B: 10
B: 20
B: 30
B: 40
A: 6

我期望的是几秒钟后(配置为 1000?),当我运行 kafka-console-consumer.sh --bootstrap-server localhost:9092 --property print.key=true --topic test1--from-beginning,我应该得到

What I expect is after few seconds (1000 as configured?), when I run kafka-console-consumer.sh --bootstrap-server localhost:9092 --property print.key=true --topic test1 --from-beginning, I should get

A: 6
B: 40

相反,我得到了:

A: 5
B: 40
A: 6

如果我发布另一条消息 B:50 并运行消费者,我得到:

If I publish another message B:50 and runs the consumer, I got :

B: 40
A: 6
B: 50

而不是预期

A: 6
B: 50

  1. 实际上,如何配置日志压缩?
  2. 来自 Kafka 文档:日志压缩确保 Kafka 将始终保留单个主题分区的数据日志中每个消息键的至少最后一个已知值
    这是否意味着我只能对单个分区的主题使用日志压缩?

推荐答案

基本上,您自己已经提供了答案.正如 Kafka 文档中所述,日志压缩确保 Kafka 始终为单个主题分区的数据日志中的每个消息键保留至少最后一个已知值".因此,不能保证您对于一个键总是只有一条消息.

Basically, you provided the answer already yourself. As stated in the Kafka documentation, "log compaction ensures that Kafka will always retain at least the last known value for each message key within the log of data for a single topic partition". So it is not guaranteed that you will always have exactly one message for one key.

如果我正确理解日志压缩,那么它并不适用于您在非常有效的问题中提出的用例.相反,它旨在最终达到主题中每个键仅存在一条消息的阶段.

If I understand the log compaction correctly, it is not meant for use cases like you came up in the very valid question. Rather, it is meant to eventually get to the stage that only one message per key is present in the topic.

日志压缩是一种机制,可以提供更细粒度的每条记录保留,而不是更粗粒度的基于时间的保留.这个想法是有选择地删除我们使用相同主键进行更新的记录.这样日志保证至少有每个键的最后一个状态.

Log compaction is a mechanism to give finer-grained per-record retention, rather than the coarser-grained time-based retention. The idea is to selectively remove records where we have a more recent update with the same primary key. This way the log is guaranteed to have at least the last state for each key.

如果您计划只保留每个键的最新状态,以处理尽可能少的旧状态(您将拥有非压缩主题,取决于时间/大小),则压缩主题是正确的选择基于保留).据我所知,日志压缩的用例是保存最新的地址、手机号码、数据库中的值等.这些值不会每时每刻都在变化,而且通常有很多键.

A compacted topic is the right choice if you plan to keep only the latest state for each key with the goal to process as less as possible old states (what you would have with a non-compacted topic, depending on time/size-based retention). Use cases for log compaction are, as far as I have learned, rather for keeping the latest address, mobile number, value in a database etc.. Values which are not changing every moment and where you usually have many keys.

从技术角度来看,我猜您的情况发生了以下情况.

From a technical perspective I guess the following happened in your case.

当涉及到压缩时,日志被视为分成两部分

When it comes to compaction, the log is viewed as split into two portions

  • Clean:之前已被压缩的消息.此部分仅包含每个键的一个值,即上次压缩时的最新值.
  • :上次压缩后写入的消息.
  • Clean: Messages that have been compacted before. This section contains only one value for each key, which is the latest value at the time of the pervious compaction.
  • Dirty: Messages that were written after the last compaction.

在生成消息 B: 40(A: 5 已经生成)之后,日志的 clean 部分是空的,并且 dirty/active 部分包含 A: 5B: 40.消息 A: 6 根本不是日志的一部分.生成新消息 A: 6 将在日志的脏部分(因为您的比率非常低)开始压缩,但排除新消息本身.如前所述,没有什么要清理的了,所以新消息将被添加到主题中,现在位于日志的脏部分.您在生成 B: 50 时观察到的情况与此相同.

After producing the messages B: 40 (A: 5 was already produced) the clean part of the log is empty and the dirty/active part contains A: 5 and B: 40. The message A: 6 is not yet part of the log at all. Producing the new message A: 6 will start the compaction on the dirty part (because your ratio is very low) of the log but excluding the new message itself. As mentioned, there is nothing more to clean, so the new message will just be added to the topic, and is now in the dirty part of the log. Same happens what you have observed when producing B: 50.

此外,压缩将永远发生在您的活动段上.因此,即使您将 segment.ms 设置为 1000 ms 它也不会产生新的段,因为在产生 A: 6B: 50.

In addition, the compaction will never happen on your active segment. So, even though you set segment.ms to just 1000 ms it will not produce a new segment as no new data is incoming after producing A: 6 or B: 50.

为了解决您的问题并观察您需要在生成 A: 6B: 50 后生成另一条消息 C: 1 的期望.这样,清洁器可以再次比较日志中干净和脏的部分,并删除 A: 5B: 40.

To solve your issue and observe the expectations you need to produce another message C: 1 after producing A: 6 or B: 50. In that way the cleaner can compare again the clean and dirty parts of the log and will remove A: 5 or B: 40.

同时,查看段在您的 Kafka 日志目录中的行为.

In the meantime, look how the segments behave in your log directory of Kafka.

在我看来,日志压缩的配置完全没问题!观察预期行为并不是正确的用例.但是对于生产用例,请注意您当前的配置尝试非常频繁地启动压缩.根据您的数据量,这可能会变得非常 I/O 密集.默认比率设置为 0.50 并且 log.roll.hours 通常设置为 24 小时是有原因的.此外,您通常希望确保消费者有机会在数据被压缩之前读取所有数据.

From my perspective, the configurations for the log compaction is totally fine! It is just not the right use case to observe the expected behavior. But for production use case, be aware that your current configurations try to start the compaction quite frequently. This can become quite I/O intensive depending on the volume of your data. There is a reason the default ratio is set to 0.50 and the log.roll.hours is typically set to 24 hours. Also, you usually want to ensure that consumers will have the chance to read all data before it got compacted.

这篇关于Kafka Log Compaction 总是显示相同键的最后两条记录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆