Kafka Streams删除消耗的分区记录,以减少磁盘使用量 [英] Kafka Streams deleting consumed repartition records, to reduce disk usage

查看:143
本文介绍了Kafka Streams删除消耗的分区记录,以减少磁盘使用量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们有一个约有5000万条记录的kafka实例,每天约有10万次输入,因此在kafka世界中没有什么疯狂的.当我们想使用我们更复杂的流应用程序之一(具有许多不同的聚合步骤)来重新处理这些记录时,磁盘使用情况从重新分区主题中变得非常疯狂.根据我们的了解,这些主题在kafka-streams 1.0.1和2.1.1中的Long.Max中使用标准保留时间(14天?).这非常不方便,因为对于分区主题,在我们的示例中,每条记录在聚合完成后只能读取一次,之后可以删除.

We have a kafka instance with about 50M records, with about 100k input per day, so nothing crazy in kafka-world. When we want to reprocess these records with one of our more complex stream apps (with many different steps of aggregation), the disk usage gets pretty crazy from the repartition topics. Theese topics uses the standard retention time (14 days?) in kafka-streams 1.0.1 and Long.Max in 2.1.1 from what we have understood. This is very inconvenient since for the repartition topics, in our case, each record is only read once when the aggregation is done and after that it can be deleted.

所以我们的问题是,是否有任何方法可以在kafka流中配置一个设置,以便在处理记录后清除记录?我已经看到可以使用purgeDataBefore()( https://issue.apache.org/jira/browse/KAFKA-4586 ).

So our question is if there is any way of to configure a setting in kafka-streams that purges records after they have been processed? I have seen that there is some way to do this with purgeDataBefore() (https://issues.apache.org/jira/browse/KAFKA-4586).

作为参考,该应用程序的某些部分有一些尺寸:

For reference, some sizes in a part of the app:

表1 (更改日志,紧凑〜2GB)->更改密钥和聚合(分区〜14GB)-> 表2 (更改日志,删除,14KB)->更改密钥和聚合(分区21GB)-> 表3 (更改日志,精巧版,0.5GB)

table-1 (changelog, compact ~ 2GB) --> change key and aggregate (repartition ~ 14GB) --> table-2 (changelog, delete, 14KB) --> change key and aggregate (repartition 21GB) --> table-3 (changelog, compact, 0.5GB)

(这是我的第一个堆栈溢出问题,因此,感谢您提供任何反馈,谢谢!)

(This is my first stack overflow question so any feedback is appreciated, thanks in advance!)

推荐答案

1.1 版本开始,Kafka Streams使用 purgeDataBefore() API:

Kafka Streams uses the purgeDataBefore() API since 1.1 release: https://issues.apache.org/jira/browse/KAFKA-6150

您不需要启用它(也不能禁用它).

You don't need to enable it (and you cannot disable it either).

这篇关于Kafka Streams删除消耗的分区记录,以减少磁盘使用量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆