为什么我的 kafka tmp 文件夹的大小几乎与磁盘大小相同? [英] Why my kafka tmp folder have almost same size than disk size?

查看:20
本文介绍了为什么我的 kafka tmp 文件夹的大小几乎与磁盘大小相同?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用以下形式开发生产 kafka 环境:3 个 ZK 服务器、3 个 Kafka 代理和两个 kafka 连接.我将我的 tmp 文件与我的 kafka 主文件夹并排放置.我在远程 ubuntu 环境中运行它,但不在 docker 中运行.

I develop production kafka environment with this formation: 3 ZK server, 3 Kafka brokers and Two kafka connect. I put my tmp file side-by-side with my kafka main folder. And I run it in remote ubuntu environment but not in docker.

当我运行我的 kafka 操作时,我遇到错误,通知我的磁盘消耗过多.我检查了我的 kafka tmp 文件夹,发现它的大小几乎是我磁盘大小的 2/3,这会关闭我的 kafka 集群.

When i operate my kafka operation, i experienced error which inform my disk are consumed too much. I check my kafka tmp folder that the size is about almost 2/3 of my disk size, which turn off my kafka cluster.

我检查了每个 kafka log_folder 并发现了这个:

I have inspect for each kafka log_folder and found this:

  1. 25 connect_offset 来自 1 号工人 @21MB 每个人
  2. 25 connect_offset2 来自 2 号工人 @21MB 每个人
  3. 25 connect_status 来自 1 号工人 @21MB 每个人
  4. 25 connect_status2 来自 2 号工人 @21MB 每个人
  5. 50 个 __consumer_offset 来自两个工人 @21MB 每个工人
  6. 每个主题的主题偏移 @21Mb,我有 2 个主题,所以我有 6 个主题偏移
  1. 25 connect_offset from workers no.1 @21MB for each one
  2. 25 connect_offset2 from workers no.2 @21MB for each one
  3. 25 connect_status from workers no.1 @21MB for each one
  4. 25 connect_status2 from workers no.2 @21MB for each one
  5. 50 __consumer_offset from both workers @21MB for each one
  6. topics offset @21Mb for each one per topics, which I have 2 topics so I have 6 topics offset

问题是 __consumer_offset 的数量比其他偏移量消耗更多的磁盘,而我的 kafka_config 无法处理它.这是我的 kafka_configuration 示例:

The problem is the number of __consumer_offset is consume more disk than the other offset, and my kafka_config cannot handle it. This is the example of my kafka_configuration:

broker.id=101
port=9099
listeners=PLAINTEXT://0.0.0.0:9099
advertised.listeners=PLAINTEXT://127.0.0.1:9099
num.partitions=3
offsets.topic.replication.factor=3
log.dir=/home/xxx/tmp/kafka_log1
log.cleaner.enable=true
log.cleanup.policy=delete
log.retention.bytes=1073741824
log.segment.bytes=1073741824
log.retention.check.interval.ms=60000
message.max.bytes=1073741824
zookeeper.connect=xxx:2185,xxx:2186,xxx:2187
zookeeper.connection.timeout.ms=7200000
session.time.out.ms=30000
delete.topic.enable=true

对于每个主题,这是配置:

And for each topics, this is the config:

kafka-topics.sh -create --zookeeper xxx:2185,xxx:216,xxx:2187 --replication-factor 3 --partitions 3 --topic $topic_name --config cleanup.policy=delete --config retention.ms=86400000 --config min.insync.replicas=2 --config compression.type=gzip

像这样的连接配置(连接配置共享相同的配置,除了端口和偏移量和状态配置.):

And the connect config like this (connect config share identical config except port and offset and status config.):

bootstrap.servers=XXX:9099,XXX:9098,XXX:9097
group.id=XXX
key.converter.schemas.enable=true
value.converter.schemas.enable=true
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
offset.storage.topic=connect-offsets
offset.storage.replication.factor=3
config.storage.topic=connect-configs
config.storage.replication.factor=3
status.storage.topic=connect-status
status.storage.replication.factor=3
offset.flush.timeout.ms=300000
rest.host.name=xxx
rest.port=8090
connector.client.config.override.policy=All
producer.max.request.size=1073741824
producer.ack=all
producer.enable.idempotence=true
consumer.max.partition.fetch.bytes=1073741824
consumer.auto.offset.reset=latest
consumer.enable.auto.commit=true
consumer.max.poll.interval.ms=5000000
plugin.path=/xxx/connectors

很明显,根据一些文档,Kafka 不需要很大的磁盘空间(最大记录的 tmp 是 36 GB).

It's very obvious that according to several documentation, Kafka doesn't need large disk space (the largest recorded tmp is 36 GB).

推荐答案

@ 21 MB"是什么意思?您的 log.segment.bytes 设置为 1GB...

What do you mean "@ 21 MB"? Your log.segment.bytes is set at 1GB...

首先,永远不要使用 /tmp 进行持久存储.并且不要将 /home 用于服务器数据.始终为服务器数据以及 /var + /var/logs 使用单独的分区/磁盘.

First, don't use /tmp for persistent storage, ever. And don't use /home for server data. Always use a separate partition/disk for server data as well as /var + /var/logs.

其次,您有 2 个连接集群.使用相同的 3 个主题和相同的 group.id,那么您就有了 1 个分布式集群,并且您可以避免拥有 3 个额外的主题.

Second, you have 2 Connect Clusters. Use the same 3 topics and the same group.id, then you have 1 Distribtued Cluster and you save yourself from having 3 extra topics.

最后,

__consumer_offset 的数量比其他偏移量消耗更多的磁盘

the number of __consumer_offset is consume more disk than the other offset

嗯,是的.所有消费者组都在那里存储他们的偏移量.到目前为止,这将是最大的内部主题,具体取决于您的 offsets.保留分钟

Well, yes. All consumer groups store their offsets there. This will be the largest internal topic, by far, depending on your offsets.retention.minutes

Kafka 不需要大磁盘空间

Kafka doesn't need large disk space

它不会当您开始时.

It doesn't when you are getting started.

我见过拥有数百 TB 存储空间的集群

I've seen clusters with tens-hundreds of TB of storage

如果您观看大公司在 Kafka 峰会上的演讲,他们会每秒发送 GB 级的事件(参考 Netflix、Spotify、Uber 等)

If you watch Kafka Summit talks from large companies, they are sending GB of events per second (ref. Netflix, Spotify, Uber, etc)

  1. Apache
  2. 融合

这篇关于为什么我的 kafka tmp 文件夹的大小几乎与磁盘大小相同?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆