Kafka 保留策略 [英] Kafka retention policies

查看:22
本文介绍了Kafka 保留策略的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个多代理(在同一主机上运行)Kafka 设置,其中包含 3 个代理和 50 个主题,每个主题配置为具有 7 个分区和 3 的复制因子.

Assume that I have a multi-broker (running on the same host) Kafka setup with 3 brokers and 50 topics each of which is configured to have 7 partitions and replication factor of 3.

我有 50GB 的内存可用于 kafka,并确保 Kafka 日志永远不会超过这个内存量,因此我想配置我的保留策略以防止出现这种情况.

I have 50GB of memory to spend for kafka and make sure that Kafka logs will never exceed this amount of memory so I want to configure my retention policy in order to prevent this scenario.

我已经设置了删除清理策略:

I have setuup a delete cleanup policy:

log.cleaner.enable=true
log.cleanup.policy=delete

并且需要配置以下属性,以便每周删除数据并且永远不会耗尽内存:

and need to configure the following properties so that the data is deleted on a weekly basis and I will never run out of memory:

log.retention.hours
log.retention.bytes
log.segment.bytes
log.retention.check.interval.ms
log.roll.hours

这些主题包含由总大小约为 10GB 的数据库中的表流式传输的数据(但插入、更新或删除会不断在这些主题中流式传输).

These topics contain data streamed by tables on a Database that have a total size of about 10GB (but inserts, updates or deletes are constantly streamed in these topics).

我应该如何配置上述参数,以便每 7 天删除一次数据,并确保在需要时可以在较短的窗口中删除数据,以免内存不足?

How should I go about configuring the aforementioned parameters so that the data is removed every 7 days and make sure that data might be deleted in a shorter window if needed so that I won't run out of memory?

推荐答案

关于时间保留很容易,只需将其设置为您需要的即可.

Regarding the time retention it's easy, just set it to what you need.

对于大小保留,这不是微不足道的,原因如下:

For the size retention, this is not trivial for several reasons:

  1. 保留限制是最低保证.这意味着如果您将 log.retention.bytes 设置为 1GB,您将始终有至少 1GB 的磁盘数据可用.这不包括分区可以占用的最大磁盘大小,仅涵盖下限.

  1. the retention limits are minimum guarantees. This means if you set log.retention.bytes to 1GB, you will always have at least 1GB to data available on disk. This does not cover the maximum size on disk the partition can take, only the lower bound.

日志清理器仅定期运行(默认情况下每 5 分钟运行一次),因此在最坏的情况下,您可能会得到 1GB + 5 分钟内可写入的数据量.根据您的环境,这可能是大量数据.

the log cleaner only runs periodically (every 5 mins by default), so in the worst case scenario, you could end up with 1GB + the amount of data that can be written in 5 minutes. Depending on your environment, that can be a lot of data.

除了分区的数据之外,Kafka 还会将一些文件(主要是索引)写入磁盘.虽然这些文件通常很小(默认为 10MB),但您可能需要考虑它们.

Kafka writes a few more files (mostly indexes) to disk in addition of the partition's data. While these files are usually small (10MB by default), you may have to consider them.

忽略索引,可以用来估计分区最大磁盘大小的一种不错的启发式方法是:

Ignoring the indexes, one decent heuristic you can use to estimate the max disk size of a partition is:

SIZE = segment.bytes + retention.bytes

在正常环境中,很少有所有分区同时超过其限制,因此通常可以忽略第二点.

In a normal environment it's rare all partitions exceed their limits at the same time so it's usually possible to ignore the second point.

如果你想计算索引,那么你还需要为每个段添加两次segment.index.bytes(有2个索引:偏移量和时间戳).

If you want to count indexes then you need to also add segment.index.bytes twice (there are 2 indexes: offset and timestamp) for each segment.

有 3 个代理和 3 个副本,每个代理将托管 350 个分区.由于 Kafka 不喜欢完整磁盘,因此包含fudge factor"也可能更安全!因此,删除总磁盘大小的 5-10%,尤其是在不计算索引的情况下.

With 3 brokers and 3 replicas, each broker will host 350 partitions. It's also probably safer to include a "fudge factor" as Kafka does not like full disk! So remove 5-10% of your total disk size, especially if you don't count indexes.

考虑到所有这些问题,您应该能够找到所需的日志大小.

With all these gotchas in mind you should be able to find the log size you need.

这篇关于Kafka 保留策略的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆