如何找到Kafka brokers CPU使用率高的根本原因? [英] How to find the root cause of high CPU usage of Kafka brokers?

查看:34
本文介绍了如何找到Kafka brokers CPU使用率高的根本原因?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我负责操作两个 kafka 集群(一个用于生产环境,一个用于我们的开发环境).设置大多相似,但开发环境没有 SASL/SSL 设置,仅使用 4 个代理而不是 8 个代理.每个代理都分配给一个专用的 google kubernetes 节点,该节点具有 4 个 vCPU 和 26GB RAM.

在我们的开发环境中,我们每秒收到大约 1000 条消息,并且 4 个代理中的每一个都非常一致地使用 4 个可用 CPU 内核中的 3 个(75% CPU 使用率).

在我们的 prod 环境中,我们每秒收到大约 1500 条消息,CPU 使用率也是那里的 4 个内核中的 3 个.

似乎 CPU 使用率至少是我们的瓶颈,我想知道如何执行 CPU 分析,以便我知道究竟是什么导致了高 CPU 使用率.由于它相对一致,我想这可能是我们的快速压缩.

我对如何调查高 cpu 使用率的原因以及如何在我的集群中进行调整的所有想法都很感兴趣.

  • Apache Kafka 版本:2.1(CPU 负载曾经在 Kafka 0.11.x 上也类似)

  • 开发集群(Snappy 压缩,无 SASL/SSL,4 个 Broker):1000 条消息/秒,3 个 CPU 内核一致使用

  • Prod 集群(Snappy 压缩、SASL/SSL、8 个 Broker):1500 条消息/秒,3 个 CPU 内核一致使用

旁注:我已经确保生产者生产他们的消息快速压缩.我可以访问所有 J​​MX 指标,但找不到任何有助于计算 CPU 使用率的信息.

我的 prometheus 已经附加了指标(我也从这里获得了 CPU 使用情况统计数据).问题是容器的 CPU 使用率并没有告诉我为什么它那么高.我需要更多的粒度 e.G.CPU 周期用于(压缩?代理通信?sasl/ssl?).

解决方案

如果您可以访问 JMX 指标,那么您几乎完成了对 CPU 的分析.所有要做的就是安装 Prometheus 和 Grafana,然后将指标存储在 Prometheus 中并使用 Grafana 监控它们.您可以在

注意:如果您怀疑 snappy 压缩,也许 这个性能测试可以帮你

更新:

基于Confluent,大部分CPU使用是因为SSL.

<块引用>

请注意,如果启用 SSL,则 CPU 要求可能会显着增加更高(具体细节取决于 CPU 类型和 JVM实施).

您应该选择具有多个内核的现代处理器.常见的集群使用 24 台核心机器.

如果您需要在更快的 CPU 或更多内核之间进行选择,请选择更多核心.多核提供的额外并发将远远超过比稍微快一点的时钟速度更重要.

I am in charge of operating two kafka clusters (one for prod and one for our dev environment). The setup is mostly similiar, but the dev environment has no SASL/SSL setup and uses just 4 instead of 8 brokers. Each broker is assigned to a dedicated google kubernetes node with 4 vCPU and 26GB RAM.

On our dev environment we've got roughly 1000 messages in / sec and each of the 4 brokers uses pretty consistently 3 out of the 4 available CPU cores (75% CPU usage).

On our prod environment we got about 1500 messages in / sec and the CPU usage is also 3 out of 4 cores there.

It seems that CPU usage is at least the bottleneck for us and I'd like to know how I can perform a CPU profiling, so that I know what exactly is causing the high cpu usage. Since it's relatively consistent I guess it could be our snappy compression.

I am interested in all ideas how I could investigate the cause of the high cpu usage and how I could tweak that in my cluster.

  • Apache Kafka version: 2.1 (CPU load used to be similiar on Kafka 0.11.x too)

  • Dev Cluster (Snappy compression, no SASL/SSL, 4 Brokers): 1000 messages in / sec, 3 CPU cores consistent usage

  • Prod cluster (Snappy compression, SASL/SSL, 8 Brokers): 1500 messages in / sec, 3 CPU cores consistent usage

Side note: I already made sure producers produce their messages snappy compressed. I have access to all JMX metrics, couldn't find anything useful for figuring out the CPU usage though.

I already have metrics attached to my prometheus (this is where I got the CPU usage stats from too). The problem is that the container's CPU usage doesn't tell me WHY it is that high. I need more granularity e. g. what are CPU cycles being spent on (compression? broker communication? sasl/ssl?).

解决方案

If you have access to JMX metrics you are almost done for profiling CPU. All thing have to do is installing Prometheus and Grafana and then store metrics in Prometheus and monitor them with Grafana. You can find complete steps in Monitoring Kafka

Note: If you are suspicious about snappy compression, maybe this performance test can help you

Update:

Based on Confluent, most of the CPU usage is because of SSL.

Note that if SSL is enabled, the CPU requirements can be significantly higher (the exact details depend on the CPU type and JVM implementation).

You should choose a modern processor with multiple cores. Common clusters utilize 24 core machines.

If you need to choose between faster CPUs or more cores, choose more cores. The extra concurrency that multiple cores offers will far outweigh a slightly faster clock speed.

这篇关于如何找到Kafka brokers CPU使用率高的根本原因?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆