即使在保留时间/大小之后,数据仍然保留在 Kafka 主题中 [英] Data still remains in Kafka topic even after retention time/size
问题描述
我们将日志保留时间
设置为1小时如下(之前设置为72H)
We set the log retention hours
to 1 hour as the following (previously setting was 72H)
使用以下 Kafka 命令行工具,我们将 kafka retention.ms
设置为 1H
.我们的目标是清除主题中早于 1H 的数据 - test_topic
,因此我们使用了以下命令:
Using the following Kafka command line tool, we set the kafka retention.ms
to 1H
. Our aim is to purge the data that is older then 1H in topic - test_topic
, so we used the following command:
kafka-configs.sh --alter \
--zookeeper localhost:2181 \
--entity-type topics \
--entity-name topic_test \
--add-config retention.ms=3600000
还有
kafka-topics.sh --zookeeper localhost:2181 --alter \
--topic topic_test \
--config retention.ms=3600000
两个命令都运行没有错误.
Both commands ran without errors.
但问题在于 Kafka 数据比 1H 还要早,但仍然存在!
But the problem is about Kafka data that is older then 1H and still remains!
实际上没有从主题 topic_test
分区中删除数据.我们有 HDP Kafka 集群版本 1.0x 和 ambari
Actually no data was removed from the topic topic_test
partitions. We have HDP Kafka cluster version 1.0x and ambari
我们不明白为什么关于主题的数据 - topic_test
仍然存在?即使在我们已经描述过的两个 cli 运行之后也没有减少
We do not understand why data on topic - topic_test
still remained? and not decreased even after we run both cli as already described
以下 kafka cli 有什么问题?
what is wrong on the following kafka cli?
kafka-configs.sh --alter --zookeeper localhost:2181 --entity-type topics --entity-name topic_test --add-config retention.ms=3600000
kafka-topics.sh --zookeeper localhost:2181 --alter --topic topic_test --config retention.ms=3600000
从 Kafka server.log
我们可以看到以下内容
from the Kafka server.log
we ca see the following
2020-07-28 14:47:27,394] INFO Processing override for entityPath: topics/topic_test with config: Map(retention.bytes -> 2165441552, retention.ms -> 3600000) (kafka.server.DynamicConfigManager)
[2020-07-28 14:47:27,397] WARN retention.ms for topic topic_test is set to 3600000. It is smaller than message.timestamp.difference.max.ms's value 9223372036854775807. This may result in frequent log rolling. (kafka.server.TopicConfigHandler)
参考 - https://ronnieroller.com/kafka/cheat-sheet>
推荐答案
日志清理器只会处理非活动(有时也称为旧"或干净")段.只要所有数据都适合其大小由 segment.bytes
大小限制定义的活动(脏"、不干净")段,就不会发生清理.
The log cleaner will only work on inactive (sometimes also referred to as "old" or "clean") segments. As long as all data fits into the active ("dirty", "unclean") segment where its size is defined by segment.bytes
size limit there will be no cleaning happening.
配置cleanup.policy
描述为:
删除"或删除"的字符串或紧凑"或两者.此字符串指定用于旧日志段的保留策略.默认策略(删除")将在达到保留时间或大小限制时丢弃旧段.紧凑型"设置将启用主题的日志压缩.
A string that is either "delete" or "compact" or both. This string designates the retention policy to use on old log segments. The default policy ("delete") will discard old segments when their retention time or size limit has been reached. The "compact" setting will enable log compaction on the topic.
另外,segment.bytes
是:
此配置控制日志的段文件大小.保留和清理总是一次完成一个文件,因此较大的段大小意味着更少的文件,但对保留的控制较弱.
This configuration controls the segment file size for the log. Retention and cleaning is always done a file at a time so a larger segment size means fewer files but less granular control over retention.
配置segment.ms
也可以用来引导删除:
The configuration segment.ms
can also be used to steer the deletion:
此配置控制了一段时间后,即使段文件未满,Kafka 也会强制日志滚动,以确保保留可以删除或压缩旧数据.
This configuration controls the period of time after which Kafka will force the log to roll even if the segment file isn't full to ensure that retention can delete or compact old data.
由于它默认为一周,您可能希望减少它以满足您的需要.
As it defaults to one week, you might want to reduce it to fit your needs.
因此,如果您想将某个主题的保留设置为例如您可以设置一小时:
Therefore, if you want to set the retention of a topic to e.g. one hour you could set:
cleanup.policy=delete
retention.ms=3600000
segment.ms=3600000
file.delete.delay.ms=1 (The time to wait before deleting a file from the filesystem)
segment.bytes=1024
注意:我指的不是 retention.bytes
.segment.bytes
与上面描述的完全不同.另外,请注意 log.retention.hours
是一个集群范围的配置.因此,如果您计划为不同的主题设置不同的保留时间,这将解决问题.
Note: I am not referring to retention.bytes
. The segment.bytes
is a very different thing as described above. Also, be aware that log.retention.hours
is a cluster-wide configuration. So, if you plan to have different retention times for different topics this will solve it.
这篇关于即使在保留时间/大小之后,数据仍然保留在 Kafka 主题中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!