Kafka Streams 删除消耗的重新分区记录,以减少磁盘使用 [英] Kafka Streams deleting consumed repartition records, to reduce disk usage

查看:30
本文介绍了Kafka Streams 删除消耗的重新分区记录,以减少磁盘使用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们有一个 kafka 实例,大约有 5000 万条记录,每天大约有 10 万条输入,所以在 kafka 世界中没有什么疯狂的.当我们想用一个更复杂的流应用程序(具有许多不同的聚合步骤)重新处理这些记录时,磁盘使用率会因重新分区主题而变得非常疯狂.根据我们的理解,这些主题使用 kafka-streams 1.0.1 中的标准保留时间(14 天?)和 2.1.1 中的 Long.Max.这是非常不方便的,因为对于重新分区主题,在我们的例子中,每条记录在聚合完成后只读取一次,然后可以删除.

We have a kafka instance with about 50M records, with about 100k input per day, so nothing crazy in kafka-world. When we want to reprocess these records with one of our more complex stream apps (with many different steps of aggregation), the disk usage gets pretty crazy from the repartition topics. Theese topics uses the standard retention time (14 days?) in kafka-streams 1.0.1 and Long.Max in 2.1.1 from what we have understood. This is very inconvenient since for the repartition topics, in our case, each record is only read once when the aggregation is done and after that it can be deleted.

所以我们的问题是,是否有任何方法可以在 kafka-streams 中配置一个设置,以便在处理后清除记录?我已经看到有一些方法可以使用 purgeDataBefore() (https://issue.apache.org/jira/browse/KAFKA-4586).

So our question is if there is any way of to configure a setting in kafka-streams that purges records after they have been processed? I have seen that there is some way to do this with purgeDataBefore() (https://issues.apache.org/jira/browse/KAFKA-4586).

作为参考,部分应用的尺寸:

For reference, some sizes in a part of the app:

table-1 (changelog, compact ~ 2GB) --> 更改 key 和聚合 (repartition ~ 14GB) --> table-2 (changelog, delete,14KB) --> 更改密钥和聚合(重新分区 21GB) --> table-3(更改日志,紧凑,0.5GB)

table-1 (changelog, compact ~ 2GB) --> change key and aggregate (repartition ~ 14GB) --> table-2 (changelog, delete, 14KB) --> change key and aggregate (repartition 21GB) --> table-3 (changelog, compact, 0.5GB)

(这是我的第一个堆栈溢出问题,因此感谢任何反馈,提前致谢!)

(This is my first stack overflow question so any feedback is appreciated, thanks in advance!)

推荐答案

Kafka Streams 从 1.1 版本开始使用 purgeDataBefore() API:https://issues.apache.org/jira/browse/KAFKA-6150

Kafka Streams uses the purgeDataBefore() API since 1.1 release: https://issues.apache.org/jira/browse/KAFKA-6150

您不需要启用它(也不能禁用它).

You don't need to enable it (and you cannot disable it either).

这篇关于Kafka Streams 删除消耗的重新分区记录,以减少磁盘使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆