即使在保留时间/大小之后,数据仍然保留在 Kafka 主题中 [英] Data still remains in Kafka topic even after retention time/size

查看:30
本文介绍了即使在保留时间/大小之后,数据仍然保留在 Kafka 主题中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们将日志保留时间设置为1小时如下(之前设置为72H)

We set the log retention hours to 1 hour as the following (previously setting was 72H)

使用以下 Kafka 命令行工具,我们将 kafka retention.ms 设置为 1H.我们的目标是清除主题中早于 1H 的数据 - test_topic,因此我们使用了以下命令:

Using the following Kafka command line tool, we set the kafka retention.ms to 1H. Our aim is to purge the data that is older then 1H in topic - test_topic, so we used the following command:

kafka-configs.sh --alter \
  --zookeeper localhost:2181  \
  --entity-type topics \
  --entity-name topic_test \
  --add-config retention.ms=3600000

还有

kafka-topics.sh --zookeeper localhost:2181 --alter \
  --topic topic_test \
  --config retention.ms=3600000

两个命令都运行没有错误.

Both commands ran without errors.

但问题在于 Kafka 数据比 1H 还要早,但仍然存在!

But the problem is about Kafka data that is older then 1H and still remains!

实际上没有从主题 topic_test 分区中删除数据.我们有 HDP Kafka 集群版本 1.0x 和 ambari

Actually no data was removed from the topic topic_test partitions. We have HDP Kafka cluster version 1.0x and ambari

我们不明白为什么关于主题的数据 - topic_test 仍然存在?即使在我们已经描述过的两个 cli 运行之后也没有减少

We do not understand why data on topic - topic_test still remained? and not decreased even after we run both cli as already described

以下 kafka cli 有什么问题?

what is wrong on the following kafka cli?

kafka-configs.sh --alter --zookeeper localhost:2181  --entity-type topics  --entity-name topic_test --add-config retention.ms=3600000

kafka-topics.sh --zookeeper localhost:2181 --alter --topic topic_test --config retention.ms=3600000

从 Kafka server.log 我们可以看到以下内容

from the Kafka server.log we ca see the following

2020-07-28 14:47:27,394] INFO Processing override for entityPath: topics/topic_test with config: Map(retention.bytes -> 2165441552, retention.ms -> 3600000) (kafka.server.DynamicConfigManager)
[2020-07-28 14:47:27,397] WARN retention.ms for topic topic_test is set to 3600000. It is smaller than message.timestamp.difference.max.ms's value 9223372036854775807. This may result in frequent log rolling. (kafka.server.TopicConfigHandler)

参考 - https://ronnieroller.com/kafka/cheat-sheet

推荐答案

日志清理器只会处理非活动(有时也称为旧"或干净")段.只要所有数据都适合其大小由 segment.bytes 大小限制定义的活动(脏"、不干净")段,就不会发生清理.

The log cleaner will only work on inactive (sometimes also referred to as "old" or "clean") segments. As long as all data fits into the active ("dirty", "unclean") segment where its size is defined by segment.bytes size limit there will be no cleaning happening.

配置cleanup.policy描述为:

删除"或删除"的字符串或紧凑"或两者.此字符串指定用于旧日志段的保留策略.默认策略(删除")将在达到保留时间或大小限制时丢弃旧段.紧凑型"设置将启用主题的日志压缩.

A string that is either "delete" or "compact" or both. This string designates the retention policy to use on old log segments. The default policy ("delete") will discard old segments when their retention time or size limit has been reached. The "compact" setting will enable log compaction on the topic.

另外,segment.bytes是:

此配置控制日志的段文件大小.保留和清理总是一次完成一个文件,因此较大的段大小意味着更少的文件,但对保留的控制较弱.

This configuration controls the segment file size for the log. Retention and cleaning is always done a file at a time so a larger segment size means fewer files but less granular control over retention.

配置segment.ms也可以用来引导删除:

The configuration segment.ms can also be used to steer the deletion:

此配置控制了一段时间后,即使段文件未满,Kafka 也会强制日志滚动,以确保保留可以删除或压缩旧数据.

This configuration controls the period of time after which Kafka will force the log to roll even if the segment file isn't full to ensure that retention can delete or compact old data.

由于它默认为一周,您可能希望减少它以满足您的需要.

As it defaults to one week, you might want to reduce it to fit your needs.

因此,如果您想将某个主题的保留设置为例如您可以设置一小时:

Therefore, if you want to set the retention of a topic to e.g. one hour you could set:

cleanup.policy=delete
retention.ms=3600000
segment.ms=3600000
file.delete.delay.ms=1 (The time to wait before deleting a file from the filesystem)
segment.bytes=1024

注意:我指的不是 retention.bytes.segment.bytes 与上面描述的完全不同.另外,请注意 log.retention.hours 是一个集群范围的配置.因此,如果您计划为不同的主题设置不同的保留时间,这将解决问题.

Note: I am not referring to retention.bytes. The segment.bytes is a very different thing as described above. Also, be aware that log.retention.hours is a cluster-wide configuration. So, if you plan to have different retention times for different topics this will solve it.

这篇关于即使在保留时间/大小之后,数据仍然保留在 Kafka 主题中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆