如何找到Kafka经纪人CPU使用率高的根本原因? [英] How to find the root cause of high CPU usage of Kafka brokers?

查看:56
本文介绍了如何找到Kafka经纪人CPU使用率高的根本原因?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我负责操作两个kafka集群(一个用于prod,一个用于我们的dev环境).设置大部分是类似的,但是开发环境没有SASL/SSL设置,仅使用4个而不是8个代理.每个代理都分配给一个专用的Google kubernetes节点,该节点具有4个vCPU和26GB RAM.

在我们的开发环境中,每秒大约有1000条消息,并且这4个代理中的每个代理都在4个可用CPU核心中始终使用3个(75%的CPU使用率).

在我们的生产环境中,我们每秒获得约1500条消息,并且CPU使用率也是4个内核中的3个.

看来CPU使用率至少是我们的瓶颈,我想知道如何执行CPU配置文件,以便我确切地知道导致cpu使用率高的原因.由于它是相对一致的,我想这可能是我们的快速压缩.

我对所有有关如何调查CPU使用率高的原因以及如何在群集中进行调整的想法都很感兴趣.

  • Apache Kafka版本:2.1(CPU负载以前在Kafka 0.11.x上也很相似)

  • Dev群集(快速压缩,无SASL/SSL,4个代理):每秒1000条消息,3个CPU内核的使用率一致

  • 产品集群(Snappy压缩,SASL/SSL,8个代理):每秒1500条消息,使用3个CPU核心保持一致

侧面说明:我已经确保生产者以快速压缩的方式产生他们的消息.我可以访问所有JMX指标,但是找不到任何有用的方法来弄清CPU使用率.

我已经将度量标准附加到了我的方法上(这也是我从中获得CPU使用情况统计信息的地方).问题在于容器的CPU使用率并没有告诉我为什么它是如此之高.我需要更多的粒度e.G.在压缩(代理程序通信或sasl/ssl?)上花费了多少CPU周期.

解决方案

如果您有权访问JMX指标,则几乎可以完成对CPU的性能分析.所有要做的就是安装Prometheus和Grafana,然后将指标存储在Prometheus中,并使用Grafana对其进行监视.您可以在

注意:如果您对快速压缩感到怀疑,也许此性能测试可以帮助您

更新:

基于 Confluent ,大多数CPU使用情况是由于SSL.

请注意,如果启用了SSL,则对CPU的要求可能会很高更高(具体细节取决于CPU类型和JVM实施).

您应该选择具有多个内核的现代处理器.常见的集群利用24台核心计算机.

如果您需要在更快的CPU或更多的内核之间进行选择,请选择更多核心.多核提供的额外并发性将远远超过胜过稍快的时钟速度.

I am in charge of operating two kafka clusters (one for prod and one for our dev environment). The setup is mostly similiar, but the dev environment has no SASL/SSL setup and uses just 4 instead of 8 brokers. Each broker is assigned to a dedicated google kubernetes node with 4 vCPU and 26GB RAM.

On our dev environment we've got roughly 1000 messages in / sec and each of the 4 brokers uses pretty consistently 3 out of the 4 available CPU cores (75% CPU usage).

On our prod environment we got about 1500 messages in / sec and the CPU usage is also 3 out of 4 cores there.

It seems that CPU usage is at least the bottleneck for us and I'd like to know how I can perform a CPU profiling, so that I know what exactly is causing the high cpu usage. Since it's relatively consistent I guess it could be our snappy compression.

I am interested in all ideas how I could investigate the cause of the high cpu usage and how I could tweak that in my cluster.

  • Apache Kafka version: 2.1 (CPU load used to be similiar on Kafka 0.11.x too)

  • Dev Cluster (Snappy compression, no SASL/SSL, 4 Brokers): 1000 messages in / sec, 3 CPU cores consistent usage

  • Prod cluster (Snappy compression, SASL/SSL, 8 Brokers): 1500 messages in / sec, 3 CPU cores consistent usage

Side note: I already made sure producers produce their messages snappy compressed. I have access to all JMX metrics, couldn't find anything useful for figuring out the CPU usage though.

I already have metrics attached to my prometheus (this is where I got the CPU usage stats from too). The problem is that the container's CPU usage doesn't tell me WHY it is that high. I need more granularity e. g. what are CPU cycles being spent on (compression? broker communication? sasl/ssl?).

解决方案

If you have access to JMX metrics you are almost done for profiling CPU. All thing have to do is installing Prometheus and Grafana and then store metrics in Prometheus and monitor them with Grafana. You can find complete steps in Monitoring Kafka

Note: If you are suspicious about snappy compression, maybe this performance test can help you

Update:

Based on Confluent, most of the CPU usage is because of SSL.

Note that if SSL is enabled, the CPU requirements can be significantly higher (the exact details depend on the CPU type and JVM implementation).

You should choose a modern processor with multiple cores. Common clusters utilize 24 core machines.

If you need to choose between faster CPUs or more cores, choose more cores. The extra concurrency that multiple cores offers will far outweigh a slightly faster clock speed.

这篇关于如何找到Kafka经纪人CPU使用率高的根本原因?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆