Kafka保留政策 [英] Kafka retention policies

查看:79
本文介绍了Kafka保留政策的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个多代理(在同一主机上运行)Kafka设置,其中包含3个代理和50个主题,每个主题都配置为具有7个分区和3的复制因子.

Assume that I have a multi-broker (running on the same host) Kafka setup with 3 brokers and 50 topics each of which is configured to have 7 partitions and replication factor of 3.

我有50GB的内存可用于kafka,并确保Kafka日志永远不会超出此内存量,所以我想配置我的保留策略以防止出现这种情况.

I have 50GB of memory to spend for kafka and make sure that Kafka logs will never exceed this amount of memory so I want to configure my retention policy in order to prevent this scenario.

我已经设置了删除清除策略:

I have setuup a delete cleanup policy:

log.cleaner.enable=true
log.cleanup.policy=delete

,并且需要配置以下属性,以便每周删除一次数据,而我将永远不会用完内存:

and need to configure the following properties so that the data is deleted on a weekly basis and I will never run out of memory:

log.retention.hours
log.retention.bytes
log.segment.bytes
log.retention.check.interval.ms
log.roll.hours

这些主题包含由数据库中的表流式传输的数据,这些数据的总大小约为10GB(但是在这些主题中,流式传输插入,更新或删除操作的对象是不断的).

These topics contain data streamed by tables on a Database that have a total size of about 10GB (but inserts, updates or deletes are constantly streamed in these topics).

我应该如何配置上述参数,以便每7天删除一次数据,并确保可以在较短的窗口中删除数据(如果需要),以免耗尽内存?

How should I go about configuring the aforementioned parameters so that the data is removed every 7 days and make sure that data might be deleted in a shorter window if needed so that I won't run out of memory?

推荐答案

关于保留时间很简单,只需将其设置为您需要的内容即可.

Regarding the time retention it's easy, just set it to what you need.

出于尺寸保留的考虑,出于以下几个原因,这并非微不足道:

For the size retention, this is not trivial for several reasons:

  1. 保留限制是最低保证.这意味着,如果将 log.retention.bytes 设置为1GB,则磁盘上将始终至少有1GB的可用数据.

  1. the retention limits are minimum guarantees. This means if you set log.retention.bytes to 1GB, you will always have at least 1GB to data available on disk. This does not cover the maximum size on disk the partition can take, only the lower bound.

日志清除器仅定期运行(默认情况下每5分钟运行一次),因此在最坏的情况下,最终可能会得到1GB + 5分钟内可写入的数据量.根据您的环境,可能会有很多数据.

the log cleaner only runs periodically (every 5 mins by default), so in the worst case scenario, you could end up with 1GB + the amount of data that can be written in 5 minutes. Depending on your environment, that can be a lot of data.

除分区数据外,Kafka还将更多文件(主要是索引)写入磁盘.尽管这些文件通常很小(默认为10MB),但您可能不得不考虑使用它们.

Kafka writes a few more files (mostly indexes) to disk in addition of the partition's data. While these files are usually small (10MB by default), you may have to consider them.

忽略索引,您可以用来估计分区的最大磁盘大小的一种启发式方法是:

Ignoring the indexes, one decent heuristic you can use to estimate the max disk size of a partition is:

SIZE = segment.bytes + retention.bytes

在正常环境中,很少有所有分区同时超过其限制的,因此通常可以忽略第二点.

In a normal environment it's rare all partitions exceed their limits at the same time so it's usually possible to ignore the second point.

如果要计算索引,则还需要为每个段添加两次 segment.index.bytes (有2个索引:offset和timestamp).

If you want to count indexes then you need to also add segment.index.bytes twice (there are 2 indexes: offset and timestamp) for each segment.

具有3个代理和3个副本,每个代理将托管350个分区.由于Kafka不喜欢完整的磁盘,因此包含一个忽悠因素"也可能更安全!因此,请删除磁盘总大小的5-10%,尤其是在不计算索引的情况下.

With 3 brokers and 3 replicas, each broker will host 350 partitions. It's also probably safer to include a "fudge factor" as Kafka does not like full disk! So remove 5-10% of your total disk size, especially if you don't count indexes.

牢记所有这些陷阱,您应该能够找到所需的日志大小.

With all these gotchas in mind you should be able to find the log size you need.

这篇关于Kafka保留政策的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆